HP Superdome SX2000 User Service Guide, Seventh Edition - HP Integrity Superdo - Page 36

DRAM Erasure, PDC Functional Changes, Platform Dependent Hardware

Page 36 highlights

DRAM Erasure A common cause of a correctable memory error is a DRAM failure; the ability to correct this type of memory failure in hardware is called chip kill. Address or control bit failure is a common cause. Chip kill ECC schemes have added hardware logic that enables them to detect and correct more than a single-bit error when the hardware is programmed to do so. A common implementation of traditional chip kill is to scatter data bits from each DRAM component across multiple ECC code words, so that only one bit from each DRAM is used per ECC code word. Double chip kill is an extension to memory chip kill that enables the system to correct multiple ECC errors in an ECC code word. Double chip kill is also known as DRAM erasure. DRAM erasure is invoked when the number of correctable memory errors exceeds a threshold. It can be invoked on a memory subsystem, bus, rank or bank. PDC tracks the errors on the memory subsystem, bus, rank and bank in addition to the error information it tracks in the PDT. PDC Functional Changes There are three primary threads of control in the processor dependent code (PDC): the bootstrap, the errors code, and the PDC procedures. The bootstrap is the primary thread of control until the OS is launched. The boot console handler (BCH) acts as a user interface for the bootstrap, but can also be used to diagnose problems with the system. The BCH can call the PDC procedures but this explicit capability is only available in MFG mode through the Debug menu. The PDC procedures are the primary thread of control once the OS launches. Once the OS launches, the PDC code is only active when the OS calls a PDC procedure or there is an error that calls the error code. Normally, the error thread of control returns control back to the OS through OS_HPMC, OS_TOC or RFI (LPMC or CMCI). In some cases, the HPMC or MCA handler halts the cell or partition. If a correctable memory error occurs during run time, the new chipset logs the error and corrects it in memory (reactive scrubbing). Diagnostics periodically call PDC_PAT_MEM (Read Memory Module State Info) to read the errors logs. When this PDC call is made, system firmware updates the PDT, and deletes entries older than 24 hours in the structure that counts how many errors have occurred for each memory subsystem, bus, rank or bank. When the counts exceed the thresholds, PDC invokes DRAM erasure on the appropriate memory subsystem, bus, rank or bank. Invoking DRAM erasure does not interrupt the operation of the OS. When PDC invokes DRAM erasure, the information returned by PDC_PAT_MEM (Read Memory Module State Info) indicates the scope of the invocation and provides information to enable diagnostics to determine why it was invoked. PDC also sends IPMI events indicating that DRAM erasure is in use. When PDC invokes DRAM erasure, the correctable errors that caused DRAM erasure are removed from the PDT. Because invoking DRAM erasure increases the latency of memory accesses and reduces the ability of ECC to detect multibit errors, you must notify the customer that the memory subsystem must be serviced. HP recommends that the memory subsystem be serviced within a month of invoking DRAM erasure on a customer machine. The thresholds for invoking DRAM erasure are incremental, so that PDC invokes DRAM erasure on the smallest part of memory subsystem necessary to protect the system against another bit error. Platform Dependent Hardware The platform dependent hardware (PDH) includes functionality that is required by both system and management firmware. The PDH provides the following features: • An interface that passes multiple forms of information between system firmware and the MP on the SBC by the platform dependent hardware controller (PDHC, on the PDH daughter card). • Flash EPROM for PDHC boot code storage. • PDHC SRAM for operational instruction and data storage. 36 Overview

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68
  • 69
  • 70
  • 71
  • 72
  • 73
  • 74
  • 75
  • 76
  • 77
  • 78
  • 79
  • 80
  • 81
  • 82
  • 83
  • 84
  • 85
  • 86
  • 87
  • 88
  • 89
  • 90
  • 91
  • 92
  • 93
  • 94
  • 95
  • 96
  • 97
  • 98
  • 99
  • 100
  • 101
  • 102
  • 103
  • 104
  • 105
  • 106
  • 107
  • 108
  • 109
  • 110
  • 111
  • 112
  • 113
  • 114
  • 115
  • 116
  • 117
  • 118
  • 119
  • 120
  • 121
  • 122
  • 123
  • 124
  • 125
  • 126
  • 127
  • 128
  • 129
  • 130
  • 131
  • 132
  • 133
  • 134
  • 135
  • 136
  • 137
  • 138
  • 139
  • 140
  • 141
  • 142
  • 143
  • 144
  • 145
  • 146
  • 147
  • 148
  • 149
  • 150
  • 151
  • 152
  • 153
  • 154
  • 155
  • 156
  • 157
  • 158
  • 159
  • 160
  • 161
  • 162
  • 163
  • 164
  • 165
  • 166
  • 167
  • 168
  • 169
  • 170
  • 171
  • 172
  • 173
  • 174
  • 175
  • 176
  • 177
  • 178
  • 179
  • 180
  • 181
  • 182
  • 183
  • 184
  • 185
  • 186
  • 187
  • 188
  • 189
  • 190
  • 191
  • 192
  • 193
  • 194
  • 195
  • 196
  • 197
  • 198
  • 199
  • 200

DRAM Erasure
A common cause of a correctable memory error is a DRAM failure; the ability to correct this type
of memory failure in hardware is called chip kill. Address or control bit failure is a common
cause. Chip kill ECC schemes have added hardware logic that enables them to detect and correct
more than a single-bit error when the hardware is programmed to do so. A common
implementation of traditional chip kill is to scatter data bits from each DRAM component across
multiple ECC code words, so that only one bit from each DRAM is used per ECC code word.
Double chip kill is an extension to memory chip kill that enables the system to correct multiple
ECC errors in an ECC code word. Double chip kill is also known as DRAM erasure.
DRAM erasure is invoked when the number of correctable memory errors exceeds a threshold.
It can be invoked on a memory subsystem, bus, rank or bank. PDC tracks the errors on the
memory subsystem, bus, rank and bank in addition to the error information it tracks in the PDT.
PDC Functional Changes
There are three primary threads of control in the processor dependent code (PDC): the bootstrap,
the errors code, and the PDC procedures. The bootstrap is the primary thread of control until
the OS is launched. The boot console handler (BCH) acts as a user interface for the bootstrap,
but can also be used to diagnose problems with the system. The BCH can call the PDC procedures
but this explicit capability is only available in MFG mode through the Debug menu.
The PDC procedures are the primary thread of control once the OS launches. Once the OS
launches, the PDC code is only active when the OS calls a PDC procedure or there is an error
that calls the error code. Normally, the error thread of control returns control back to the OS
through OS_HPMC, OS_TOC or RFI (LPMC or CMCI). In some cases, the HPMC or MCA handler
halts the cell or partition.
If a correctable memory error occurs during run time, the new chipset logs the error and corrects
it in memory (reactive scrubbing). Diagnostics periodically call PDC_PAT_MEM (Read Memory
Module State Info) to read the errors logs. When this PDC call is made, system firmware updates
the PDT, and deletes entries older than 24 hours in the structure that counts how many errors
have occurred for each memory subsystem, bus, rank or bank. When the counts exceed the
thresholds, PDC invokes DRAM erasure on the appropriate memory subsystem, bus, rank or
bank. Invoking DRAM erasure does not interrupt the operation of the OS.
When PDC invokes DRAM erasure, the information returned by PDC_PAT_MEM (Read Memory
Module State Info) indicates the scope of the invocation and provides information to enable
diagnostics to determine why it was invoked. PDC also sends IPMI events indicating that DRAM
erasure is in use. When PDC invokes DRAM erasure, the correctable errors that caused DRAM
erasure are removed from the PDT. Because invoking DRAM erasure increases the latency of
memory accesses and reduces the ability of ECC to detect multibit errors, you must notify the
customer that the memory subsystem must be serviced. HP recommends that the memory
subsystem be serviced within a month of invoking DRAM erasure on a customer machine.
The thresholds for invoking DRAM erasure are incremental, so that PDC invokes DRAM erasure
on the smallest part of memory subsystem necessary to protect the system against another bit
error.
Platform Dependent Hardware
The platform dependent hardware (PDH) includes functionality that is required by both system
and management firmware. The PDH provides the following features:
An interface that passes multiple forms of information between system firmware and the
MP on the SBC by the platform dependent hardware controller (PDHC, on the PDH daughter
card).
Flash EPROM for PDHC boot code storage.
PDHC SRAM for operational instruction and data storage.
36
Overview