HP ProLiant DL288 ISS Technology Update, Volume 7, Number 7 - Page 4

Soft memory errors caused by natural phenomena

Page 4 highlights

ISS Technology Update Volume 7, Number 7 How much customer input goes into the design of your products? I value and seriously consider customer input and feedback, and I try to directly implement them into HP products. After all, it's a solution set that we have to deliver to our customers. What must HP do to remain the leader in industry-standard servers? Innovation is at the top of my list since product quality, reliability, and the right feature set are a given. U.S. Patents (filed): • #7327612: Method and apparatus for providing the proper voltage to memory. Nguyen, V., Bacchus, R.; 02/05/08 • #7299331: Enhanced CPU RASUM feature in ISS server. Depew, K., Nguyen, V., Heinrich, D., Engler, D.; 11/20/07 • #7246190: Method and Apparatus for providing a bus in a computer system. Nguyen, V., Venugopal, R.; 07/17/07 • #7234081: Intelligent software controlled test DIMM. Nguyen, V., Vi Lu, Krontz, J.; 11/20/07 Soft memory errors caused by natural phenomena All dynamic random access memory (DRAM) chips are inherently susceptible to soft memory errors. A DRAM chip is an integrated circuit that contains millions of memory cells. Each memory cell contains an extremely small transistor and capacitor that can store an electrical charge. When a memory cell is accessed during a read operation, an electrically charged capacitor represents a "1" data bit. An uncharged capacitor represents a "0" data bit. The charge (number of electrons) stored by a capacitor is proportional to its size, or storage area. The capacitor's charge is also directly proportional to the memory device's operating voltage. Over the years, improvements in DRAM storage density have resulted in smaller memory cells and lower operating voltage-from 5 volts originally to 1.8 volts today. The lower operating voltage allows memory to run faster and consume less power. However, as the voltage is reduced, the logical function of the memory is more susceptible to errors. In addition, as memory cells become smaller, the storage area and charge of a capacitor that represents a "1" data bit shrinks proportionately. The reduced charge makes the memory cell more sensitive to natural and man-made phenomena that can inadvertently change the charge and, therefore, the data stored in a memory cell. If the value of a data bit is changed and not corrected, the error can cause an application to crash or cause it to pass on bad data. Memory errors generally fall into two categories-hard errors and soft errors. A hard error consistently returns incorrect results, indicating a silicon defect. For example, a memory cell may be stuck so that it always returns a "0" bit, even when a "1" bit is written to it. On the other hand, soft errors are more prevalent than hard errors. Soft errors occur randomly when an electrical disturbance near a memory cell alters the charge on the capacitor. A soft error does not indicate a problem with a memory device because once the data is corrected, the same error does not reoccur. Today's servers with large memory capacities typically experience several soft memory errors annually. Scientific research has revealed that the cause of most soft errors is literally out of this world (see "Cosmic particles" sidebar). Cosmic particles Highly energized cosmic particles (neutrons or protons) that originate from solar and extra-solar sources can have enough energy to penetrate earth's atmosphere. When cosmic particles collide with air molecules, the collisions produce multiple lower energy particles that result in cascade collisions or "showers." These particles penetrate most matter. They can randomly collide with an integrated circuit (IC) and fragment silicon nuclei, producing alpha-particles and other secondary particles that can travel in all directions. If one of these particles travels near a p-n junction in the chip, it can cause a signal to change voltage and, consequently, a data bit to change value. The collision of cosmic particles with a memory chip can generate hundreds of trillions of electrons and can "flip" the value of a bit in a nearby memory cell, resulting in a single bit error. As memory cell density increases, more memory cells can be affected by a single particle collision. Plus, as voltage is lowered, the logical operation of the memory can be affected. Using error checking and correcting (ECC) algorithms can detect and correct single bit errors; however, some particle collisions can cascade to affect multiple bits. Fortunately, HP servers-with the exception of some single processor servers-feature Advanced ECC, which can reliably correct multi-bit errors when they occur within a single DRAM chip. 4

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11

ISS Technology Update
Volume 7, Number 7
4
How much customer input goes into the design of your products?
I value and seriously consider customer input and feedback, and I try to directly implement them into HP products. After all, it’s
a solution set that we have to deliver to our customers.
What must HP do to remain the leader in industry-standard servers?
Innovation is at the top of my list since product quality, reliability, and the right feature set are a given.
U.S. Patents (filed):
#7327612: Method and apparatus for providing the proper voltage to memory. Nguyen, V., Bacchus, R.; 02/05/08
#7299331: Enhanced CPU RASUM feature in ISS server. Depew, K., Nguyen, V., Heinrich, D., Engler, D.; 11/20/07
#7246190: Method and Apparatus for providing a bus in a computer system. Nguyen, V., Venugopal, R.; 07/17/07
#7234081: Intelligent software controlled test DIMM. Nguyen, V., Vi Lu, Krontz, J.; 11/20/07
Soft memory errors caused by natural phenomena
All dynamic random access memory (DRAM) chips are inherently susceptible to soft memory errors. A DRAM chip is an
integrated circuit that contains millions of memory cells. Each memory cell contains an extremely small transistor and capacitor
that can store an electrical charge. When a memory cell is accessed during a read operation, an electrically charged capacitor
represents a "1" data bit. An uncharged capacitor represents a "0" data bit.
The charge (number of electrons) stored by a capacitor is proportional to its size, or storage area. The capacitor’s charge is
also directly proportional to the memory device’s operating voltage. Over the years, improvements in DRAM storage density
have resulted in smaller memory cells and lower operating voltage—from 5 volts originally to 1.8 volts today. The lower
operating voltage allows memory to run faster and consume less power. However, as the voltage is reduced, the logical
function of the memory is more susceptible to errors. In addition, as memory cells become smaller, the storage area and charge
of a capacitor that represents a “1” data bit shrinks proportionately. The reduced charge makes the memory cell more sensitive
to natural and man-made phenomena that can inadvertently change the charge and, therefore, the data stored in a memory
cell. If the value of a data bit is changed and not corrected, the error can cause an application to crash or cause it to pass on
bad data.
Memory errors generally fall into two categories—hard errors
and soft errors. A hard error consistently returns incorrect
results, indicating a silicon defect. For example, a memory cell
may be stuck so that it always returns a “0” bit, even when a
“1” bit is written to it. On the other hand, soft errors are more
prevalent than hard errors. Soft errors occur randomly when an
electrical disturbance near a memory cell alters the charge on
the capacitor. A soft error does not indicate a problem with a
memory device because once the data is corrected, the same
error does not reoccur. Today's servers with large memory
capacities typically experience several soft memory errors
annually. Scientific research has revealed that the cause of
most soft errors is literally out of this world (see “Cosmic
particles” sidebar).
The collision of cosmic particles with a memory chip can
generate hundreds of trillions of electrons and can “flip” the value of a bit in a nearby memory cell, resulting in a single bit
error. As memory cell density increases, more memory cells can be affected by a single particle collision. Plus, as voltage is
lowered, the logical operation of the memory can be affected. Using error checking and correcting (ECC) algorithms can detect
and correct single bit errors; however, some particle collisions can cascade to affect multiple bits. Fortunately, HP servers—with
the exception of some single processor servers—feature Advanced ECC, which can reliably correct multi-bit errors when they
occur within a single DRAM chip.
Cosmic particles
Highly energized cosmic particles (neutrons or
protons) that originate from solar and extra-solar
sources can have enough energy to penetrate
earth's atmosphere. When cosmic particles collide
with air molecules, the collisions produce multiple
lower energy particles that result in cascade
collisions or “showers.” These particles penetrate
most matter. They can randomly collide with an
integrated circuit (IC) and fragment silicon nuclei,
producing alpha-particles and other secondary
particles that can travel in all directions. If one of
these particles travels near a p-n junction in the chip,
it can cause a signal to change voltage and,
consequently, a data bit to change value.