IBM 86655RY Hardware Maintenance Manual - Page 215

How They Occur, Drive protection features, Combination failures, Remapping bad sectors

Page 215 highlights

The act of synchronization executes data scrubbing. Data scrubbing can be performed in the background while allowing concurrent user disk activity on RAID-5 and RAID-1 logical drives. Combination failures: How They Occur Combination failures occur when a catastrophic drive failure occurs while there are still undetected, uncorrected sector media errors on the remaining drives in the array; the controller cannot rebuild all the data. In such cases, a double failure exists; files must be restored from backup media. See "Combination failure protection" on page 208 for information about protection provided at the ServeRAID controller level. Note: IBM provides management software, Netfinity Director, with IBM servers that ship with ServerGuide. The software monitors the status of the hardware and provides alerts when conditions are not optimal. Netfinity Director enables customers to obtain all of the information necessary for data protection. Installation of Netfinity Director or similar tools to monitor and track disk subsystem integrity is critical for the protection of stored data. Drive protection features: Note:This section explains the Drive Protection Features in greater detail. You may wish to skip this section and proceed to the procedures for synchronization and data scrubbing in the next section. The following sections describe the drive protection features of the ServeRAID controllers: • "Remapping bad sectors" • "Error Correction Code (ECC)" • "Predictive Failure Analysis (PFA)" Remapping bad sectors: Sector media errors that show up over time usually only affect a single 512 byte block of data on the disk. This sector can be marked as "bad"; the location can then be reassigned, or "remapped," to a spare sector of the drive. Most drives reserve one spare sector per track of data and can perform this operation automatically. Error Correction Code (ECC): The drive avoids potential problems by using only "reliable" sections of the disk when remapping bad sectors. For example, if a media problem develops after the data has been written, during a disk read, most drives can correct minor sector media errors automatically by using error correction code (ECC) information stored along with the data and then used in rewriting the data on the disk. If the sector is badly damaged and the data cannot be reliably rewritten to the same spot, the drive remaps the data to a spare sector on the disk. If the sector is very badly damaged, the drive may not be able to recreate the data automatically with the ECC. If no other protection (such as RAID) is in place, the system reports a read failure and the data is lost. These lost data areas are typically reported to the user via operating system messages. Predictive Failure Analysis (PFA): Note: Replace all PFA drives as soon as possible. As with any electrical/mechanical device, there are two basic failure types: 1. Gradual performance degradation of components can create a catastrophic drive failure (see "Catastrophic drive failures" on page 204). Predictive Failure Analysis performs the following remedial operations: • Monitors performance of drives Installing and configuring ServeRAID controllers 205

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68
  • 69
  • 70
  • 71
  • 72
  • 73
  • 74
  • 75
  • 76
  • 77
  • 78
  • 79
  • 80
  • 81
  • 82
  • 83
  • 84
  • 85
  • 86
  • 87
  • 88
  • 89
  • 90
  • 91
  • 92
  • 93
  • 94
  • 95
  • 96
  • 97
  • 98
  • 99
  • 100
  • 101
  • 102
  • 103
  • 104
  • 105
  • 106
  • 107
  • 108
  • 109
  • 110
  • 111
  • 112
  • 113
  • 114
  • 115
  • 116
  • 117
  • 118
  • 119
  • 120
  • 121
  • 122
  • 123
  • 124
  • 125
  • 126
  • 127
  • 128
  • 129
  • 130
  • 131
  • 132
  • 133
  • 134
  • 135
  • 136
  • 137
  • 138
  • 139
  • 140
  • 141
  • 142
  • 143
  • 144
  • 145
  • 146
  • 147
  • 148
  • 149
  • 150
  • 151
  • 152
  • 153
  • 154
  • 155
  • 156
  • 157
  • 158
  • 159
  • 160
  • 161
  • 162
  • 163
  • 164
  • 165
  • 166
  • 167
  • 168
  • 169
  • 170
  • 171
  • 172
  • 173
  • 174
  • 175
  • 176
  • 177
  • 178
  • 179
  • 180
  • 181
  • 182
  • 183
  • 184
  • 185
  • 186
  • 187
  • 188
  • 189
  • 190
  • 191
  • 192
  • 193
  • 194
  • 195
  • 196
  • 197
  • 198
  • 199
  • 200
  • 201
  • 202
  • 203
  • 204
  • 205
  • 206
  • 207
  • 208
  • 209
  • 210
  • 211
  • 212
  • 213
  • 214
  • 215
  • 216
  • 217
  • 218
  • 219
  • 220
  • 221
  • 222
  • 223
  • 224
  • 225
  • 226
  • 227
  • 228
  • 229
  • 230
  • 231
  • 232
  • 233
  • 234
  • 235
  • 236
  • 237
  • 238
  • 239
  • 240
  • 241
  • 242
  • 243
  • 244
  • 245
  • 246
  • 247
  • 248
  • 249
  • 250
  • 251
  • 252
  • 253
  • 254
  • 255
  • 256
  • 257
  • 258
  • 259
  • 260
  • 261
  • 262
  • 263
  • 264
  • 265
  • 266
  • 267
  • 268
  • 269
  • 270
  • 271
  • 272
  • 273
  • 274
  • 275
  • 276
  • 277
  • 278
  • 279
  • 280
  • 281
  • 282
  • 283
  • 284
  • 285
  • 286
  • 287
  • 288
  • 289
  • 290
  • 291
  • 292
  • 293
  • 294

Installing and configuring ServeRAID controllers
205
The act of synchronization executes data scrubbing.
Data scrubbing can be performed in the background while allowing
concurrent user disk activity on RAID-5 and RAID-1 logical drives.
Combination failures:
How They Occur
Combination failures occur when a catastrophic drive failure occurs while there are
still undetected, uncorrected sector media errors on the remaining drives in the array;
the controller cannot rebuild all the data. In such cases, a double failure exists; files
must be restored from backup media.
See
Combination failure protection
on page 208 for information about protection
provided at the ServeRAID controller level.
Note:
IBM provides management software, Netfinity Director, with IBM servers that
ship with ServerGuide. The software monitors the status of the hardware and
provides alerts when conditions are not optimal. Netfinity Director enables
customers to obtain all of the information necessary for data protection.
Installation of Netfinity Director or similar tools to monitor and track disk
subsystem integrity is critical for the protection of stored data.
Drive protection features:
Note:
This section explains the Drive Protection Features
in greater detail. You may wish to skip this section and proceed to the
procedures for synchronization and data scrubbing in the next section.
The following sections describe the drive protection features of the ServeRAID
controllers:
Remapping bad sectors
Error Correction Code (ECC)
Predictive Failure Analysis (PFA)
Remapping bad sectors:
Sector media errors
that show up over time usually only affect a
single 512 byte block of data on the disk. This sector can be marked as
bad
; the
location can then be reassigned, or
remapped,
to a spare sector of the drive.
Most drives reserve one spare sector per track of data and can perform this operation
automatically.
Error Correction Code (ECC):
The drive avoids potential problems by using only
reliable
sections of the disk when remapping bad sectors.
For example, if a media problem develops after the data has been written, during a
disk read, most drives can correct minor sector media errors automatically by using
error correction code (ECC) information stored along with the data and then used in
rewriting the data on the disk. If the sector is badly damaged and the data cannot be
reliably rewritten to the same spot, the drive remaps the data to a spare sector on the
disk. If the sector is very badly damaged, the drive may not be able to recreate the
data automatically with the ECC. If no other protection (such as RAID) is in place, the
system reports a read failure and the data is lost. These lost data areas are typically
reported to the user via operating system messages.
Predictive Failure Analysis (PFA):
Note:
Replace all PFA drives as soon as possible.
As with any electrical/mechanical device, there are two basic failure types:
1.
Gradual performance degradation
of components can create a catastrophic drive
failure (see
Catastrophic drive failures
on page 204).
Predictive Failure Analysis performs the following remedial operations:
Monitors performance of drives