HP ProLiant DL380p Avoiding server downtime from hardware errors in system mem - Page 3

Error recovery with HP Memory Quarantine

Page 3 highlights

Error recovery with HP Memory Quarantine HP Memory Quarantine allows a server to recover from uncorrectable errors that would otherwise crash the system. It is supported on HP ProLiant servers using the Intel® Xeon® 6500 and 7500 series or E7 family of processors when running MCA Recovery-aware operating systems (OSs) or hypervisors. HP Memory Quarantine uses firmware and the Machine Check Architecture (MCA) Recovery function of select Intel Xeon processors to detect memory errors before they affect application operation. This allows the system to recover from faulty memory and continue without causing system downtime. Previously available only on Intel Itanium®-based systems, several HP ProLiant servers now support HP Memory Quarantine. The servers are the DL580 G7, DL980 G7, BL620c G7, and BL680c G7 using Intel Xeon 6500 or 7500 series or the E7 family processors as well as running software that support MCA recovery. The MCA Recovery function uses a hardware-based patrol scrubber function that continuously checks for system memory errors. When it detects an uncorrectable error, the processor generates a machine check exception. The HP Memory Quarantine firmware identifies and marks the memory location as bad and unavailable, and passes the information to the operating system or hypervisor. Depending on the state of the memory pages associated with the identified memory region, the OS or hypervisor may take those pages out of the available pool. If the memory pages are already committed to a virtual machine (VM), the OS or hypervisor may choose to shut down the associated VM (Guest OS) or cause a Machine Check Exception to the Guest OS. The associated VM (Guest OS) might in turn close down the affected thread or process or it might shut down on its own (Figure 1). You can restart the halted VM or process once you replace the bad memory at the server's next maintenance cycle. Figure 1: The HP Memory Quarantine error recovery process 4 OS or hypervisor blocks use of bad memory location by any restarted or new application/VM. 1 Intel Xeon processor's MCA Recovery function detects uncorrectable memory error and machine check exception is generated. DIMM X 3 OS or hypervisor decides how to handle recovery. If necessary, application or VM using affected memory is shut down, otherwise, system keeps running. 2 HP Memory Quarantine tags memory location as bad and sends address data to OS or hypervisor 3

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7

3
Error recovery with HP Memory Quarantine
HP Memory Quarantine allows a server to recover from uncorrectable errors that would otherwise
crash the system. It is supported on HP ProLiant servers using the Intel
®
Xeon
®
6500 and 7500 series
or E7 family of processors when running MCA Recovery-aware operating systems (OSs) or
hypervisors. HP Memory Quarantine uses firmware and the Machine Check Architecture (MCA)
Recovery function of select Intel Xeon processors to detect memory errors before they affect
application operation. This allows the system to recover from faulty memory and continue without
causing system downtime. Previously available only on Intel Itanium
®
-based systems, several HP
ProLiant servers now support HP Memory Quarantine. The servers are the DL580 G7, DL980 G7,
BL620c G7, and BL680c G7 using Intel Xeon 6500 or 7500 series or the E7 family processors as
well as running software that support MCA recovery.
The MCA Recovery function uses a hardware-based patrol scrubber function that continuously checks
for system memory errors. When it detects an uncorrectable error, the processor generates a machine
check exception. The HP Memory Quarantine firmware identifies and marks the memory location as
bad and unavailable, and passes the information to the operating system or hypervisor. Depending
on the state of the memory pages associated with the identified memory region, the OS or hypervisor
may take those pages out of the available pool. If the memory pages are already committed to a
virtual machine (VM), the OS or hypervisor may choose to shut down the associated VM (Guest OS)
or cause a Machine Check Exception to the Guest OS. The associated VM (Guest OS) might in turn
close down the affected thread or process or it might shut down on its own (Figure 1). You can restart
the halted VM or process once you replace the bad memory at the server’s next maintenance cycle.
Figure 1:
The HP Memory Quarantine error recovery process
X
Intel Xeon processor’s
MCA Recovery function
detects uncorrectable
memory error and
machine check exception
is generated.
1
HP Memory Quarantine
tags memory location as
bad and sends address
data to OS or hypervisor
2
OS or hypervisor decides
how to handle recovery. If
necessary, application or
VM using affected memory
is shut down, otherwise,
system keeps running.
3
OS or hypervisor blocks
use of bad memory
location by any restarted
or new application/VM.
4
DIMM