HP ProLiant DL380p Avoiding server downtime from hardware errors in system mem - Page 2

Introduction

Page 2 highlights

Introduction IT administrators often measure the resiliency of a server by its reliability, availability and serviceability (RAS). The reliability of a server is its ability to avoid errors and faults. The availability of a server relates to its continued operation in spite of failures or errors. The serviceability of a server is its capability to self-heal and the ease in which it can be maintained. Memory RAS is key to defining the resiliency of a server in dealing with hardware errors. HP Memory Quarantine provides selfhealing RAS capability for a server to deal with errors that would otherwise result in system crashes. This paper describes how the HP Memory Quarantine mechanism works and what it can achieve for IT administrators. To maximize your return-on-investment (ROI) in the data center, you should task your servers as much as possible. Running multiple applications or setting up a physical server to operate as a number of virtual machines (VMs) requires a server configured with large amounts of memory. Errors in memory can and do occur. Servers routinely correct single-bit and some multi-bit errors with various error-correction schemes. But uncorrectable errors such as some multi-bit or hardware failures can crash a server and render it unavailable until you service it. Using servers with large memory footprints to maximize your ROI may actually increase the risk of unscheduled downtime across a number of applications, virtual machines, or both. Memory RAS techniques have evolved over time through the following mechanisms: • Error Correction Code (ECC)------ A scheme that handles minor amounts of data corruption; single-bit error correction, and multi-bit error detection. Depending on the system, ECC can help identify failing memory modules. Advanced ECC can detect and correct some multi-bit errors. • Single Device Data Correct (SDDC), also known as Single Device Disable Code------ A method that uses code to identify and disable a single DRAM device on an x4- or x8-width DIMM. • Online memory sparing, also known as DIMM Sparing or Rank Sparing------ A mode that protects against persistent DRAM failure. Sparing monitors for an excessive amount of correctable errors and, when it detects the errors, copies the contents of an unhealthy portion of memory to an available spare portion. A DIMM or a rank can perform sparing. It reduces the total amount of memory by the amount of memory used for sparing. Sparing can only handle one failure per DIMM. • Memory Mirroring------ A mode that uses two memory channels to transfer duplicate data. An uncorrectable error detected during a read from one channel will cause the system to retrieve data from the other channel. This mode provides increased protection from errors not corrected by ECC, SDDC, and memory sparing. But it reduces the amount of total memory available to the operating system by 50% and is limited to two-channel operation. • Lockstep------ A mode that extends SDDC capability from x4 DRAM devices to x8 devices by using two memory channels as a single-wide channel to transfer a longer data word. The long data word is transferred each time using 16 redundant bits to provide 8-bit error detection and 8-bit error correction to protect against a single DRAM failure. While Lockstep mode does not reduce the total amount of available memory, it degrades performance. • Double Device Data Correction (DDDC), also known as Double-Chip Sparing------ A mode that is similar to memory sparing but is more robust. It can correct both single and double DRAM device hardware errors for x4 DIMMs. By reserving one DRAM device in each rank as a spare, it ensures data availability even after hardware failures with any two x4 DRAM devices, (albeit not simultaneously). Currently, Intel Xeon E7-based systems with specific firmware support DDDC. • HP Memory Quarantine------ The method (and the subject of this document) that identifies memory regions containing uncorrectable hardware errors prior to data processing. 2

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7

2
Introduction
IT administrators often measure the resiliency of a server by its reliability, availability and
serviceability (RAS). The reliability of a server is its ability to avoid errors and faults. The availability of
a server relates to its continued operation in spite of failures or errors. The serviceability of a server is
its capability to self-heal and the ease in which it can be maintained. Memory RAS is key to defining
the resiliency of a server in dealing with hardware errors. HP Memory Quarantine provides self-
healing RAS capability for a server to deal with errors that would otherwise result in system crashes.
This paper describes how the HP Memory Quarantine mechanism works and what it can achieve for
IT administrators.
To maximize your return-on-investment (ROI) in the data center, you should task your servers as much
as possible. Running multiple applications or setting up a physical server to operate as a number of
virtual machines (VMs) requires a server configured with large amounts of memory.
Errors in memory can and do occur. Servers routinely correct single-bit and some multi-bit errors with
various error-correction schemes. But uncorrectable errors such as some multi-bit or hardware failures
can crash a server and render it unavailable until you service it. Using servers with large memory
footprints to maximize your ROI may actually increase the risk of unscheduled downtime across a
number of applications, virtual machines, or both.
Memory RAS techniques have evolved over time through the following mechanisms:
Error Correction Code (ECC)------A scheme that handles minor amounts of data corruption; single-bit
error correction, and multi-bit error detection. Depending on the system, ECC can help identify
failing memory modules. Advanced ECC can detect and correct some multi-bit errors.
Single Device Data Correct (SDDC), also known as Single Device Disable Code------A method that
uses code to identify and disable a single DRAM device on an x4- or x8-width DIMM.
Online memory sparing, also known as DIMM Sparing or Rank Sparing------A mode that protects
against persistent DRAM failure. Sparing monitors for an excessive amount of correctable errors
and, when it detects the errors, copies the contents of an unhealthy portion of memory to an
available spare portion. A DIMM or a rank can perform sparing. It reduces the total amount of
memory by the amount of memory used for sparing. Sparing can only handle one failure per
DIMM.
Memory Mirroring------A mode that uses two memory channels to transfer duplicate data. An
uncorrectable error detected during a read from one channel will cause the system to retrieve data
from the other channel. This mode provides increased protection from errors not corrected by ECC,
SDDC, and memory sparing. But it reduces the amount of total memory available to the operating
system by 50% and is limited to two-channel operation.
Lockstep------A mode that extends SDDC capability from x4 DRAM devices to x8 devices by using two
memory channels as a single-wide channel to transfer a longer data word. The long data word is
transferred each time using 16 redundant bits to provide 8-bit error detection and 8-bit error
correction to protect against a single DRAM failure. While Lockstep mode does not reduce the total
amount of available memory, it degrades performance.
Double Device Data Correction (DDDC), also known as Double-Chip Sparing------A mode that is
similar to memory sparing but is more robust. It can correct both single and double DRAM device
hardware errors for x4 DIMMs. By reserving one DRAM device in each rank as a spare, it ensures
data availability even after hardware failures with any two x4 DRAM devices, (albeit not
simultaneously). Currently, Intel Xeon E7-based systems with specific firmware support DDDC.
HP Memory Quarantine------The method (and the subject of this document) that identifies memory
regions containing uncorrectable hardware errors prior to data processing.