HP ProLiant DL380p Avoiding server downtime from hardware errors in system mem - Page 2

2

Introduction

IT administrators often measure the resiliency of a server by its reliability, availability and

serviceability (RAS). The reliability of a server is its ability to avoid errors and faults. The availability of

a server relates to its continued operation in spite of failures or errors. The serviceability of a server is

its capability to self-heal and the ease in which it can be maintained. Memory RAS is key to defining

the resiliency of a server in dealing with hardware errors. HP Memory Quarantine provides self-

healing RAS capability for a server to deal with errors that would otherwise result in system crashes.

This paper describes how the HP Memory Quarantine mechanism works and what it can achieve for

IT administrators.

To maximize your return-on-investment (ROI) in the data center, you should task your servers as much

as possible. Running multiple applications or setting up a physical server to operate as a number of

virtual machines (VMs) requires a server configured with large amounts of memory.

Errors in memory can and do occur. Servers routinely correct single-bit and some multi-bit errors with

various error-correction schemes. But uncorrectable errors such as some multi-bit or hardware failures

can crash a server and render it unavailable until you service it. Using servers with large memory

footprints to maximize your ROI may actually increase the risk of unscheduled downtime across a

number of applications, virtual machines, or both.

Memory RAS techniques have evolved over time through the following mechanisms:

•

Error Correction Code (ECC)------A scheme that handles minor amounts of data corruption; single-bit

error correction, and multi-bit error detection. Depending on the system, ECC can help identify

failing memory modules. Advanced ECC can detect and correct some multi-bit errors.

•

Single Device Data Correct (SDDC), also known as Single Device Disable Code------A method that

uses code to identify and disable a single DRAM device on an x4- or x8-width DIMM.

•

Online memory sparing, also known as DIMM Sparing or Rank Sparing------A mode that protects

against persistent DRAM failure. Sparing monitors for an excessive amount of correctable errors

and, when it detects the errors, copies the contents of an unhealthy portion of memory to an

available spare portion. A DIMM or a rank can perform sparing. It reduces the total amount of

memory by the amount of memory used for sparing. Sparing can only handle one failure per

DIMM.

•

Memory Mirroring------A mode that uses two memory channels to transfer duplicate data. An

uncorrectable error detected during a read from one channel will cause the system to retrieve data

from the other channel. This mode provides increased protection from errors not corrected by ECC,

SDDC, and memory sparing. But it reduces the amount of total memory available to the operating

system by 50% and is limited to two-channel operation.

•

Lockstep------A mode that extends SDDC capability from x4 DRAM devices to x8 devices by using two

memory channels as a single-wide channel to transfer a longer data word. The long data word is

transferred each time using 16 redundant bits to provide 8-bit error detection and 8-bit error

correction to protect against a single DRAM failure. While Lockstep mode does not reduce the total

amount of available memory, it degrades performance.

•

Double Device Data Correction (DDDC), also known as Double-Chip Sparing------A mode that is

similar to memory sparing but is more robust. It can correct both single and double DRAM device

hardware errors for x4 DIMMs. By reserving one DRAM device in each rank as a spare, it ensures

data availability even after hardware failures with any two x4 DRAM devices, (albeit not

simultaneously). Currently, Intel Xeon E7-based systems with specific firmware support DDDC.

•

HP Memory Quarantine------The method (and the subject of this document) that identifies memory

regions containing uncorrectable hardware errors prior to data processing.

Section	Page
Introduction	2
Error recovery with HP Memory Quarantine	3
Application isolation	4
Virtual machine isolation	4
Error logging	5
OS or hypervisor support	6
Hardware support	6
Conclusion	6

HP ProLiant DL380p Avoiding server downtime from hardware errors in system mem - Page 2

Introduction

Page 2 highlights