IBM 86655RY Hardware Maintenance Manual - Page 218

RAID-5 controller level protection, Catastrophic disk failure protection

Page 218 highlights

RAID-5 controller level protection: At the controller level, RAID-5 has become an industry standard method to provide increased availability for servers. RAID-5 and RAID-1 implementations allow servers to continue operation even if there is a "catastrophic" failure of a hard drive. During normal operations in a RAID-5 environment, redundant information is calculated and written out to the drives. In a ndisk environment, n-1 disks of data are provided with one disk of space dedicated to redundant "check sum" or "parity" information. For example, three 2GB drives provide 4GB of data space and 2GB of redundancy. Note: The redundant data is actually spread out over all the disks for performance reasons. Catastrophic disk failure protection: If a drive that is a member of a RAID-5 array fails, the remaining members of the array can use their redundant information to recalculate the lost data, either to respond to user requests for data or to rebuild the data stored on the lost drive when it is replaced with a new one. For example, information in Record 1 from Drive 1 is combined with the check sum information on Drive 3 to recreate information that is not available from Drive 2. As long as the array controller can access the remaining n-1 drives, the rebuild will be successful. Naturally, if a second disk failure were to suddenly occur, the array and its data would be lost. RAID-5 can only protect against the loss of a single drive. Grown sector media error protection: In this scenario, as the drive attempts to read data in a read request, it determines that Record 1 of Disk 1 has a bad sector. If the media error is minor, the drive corrects or remaps the information using the drive ECC information, which is transparent to the RAID array. If the disk cannot recreate the information from the ECC information on the drive, the controller determines if the data is still lost, as it was without RAID support. In such a case, ServeRAID controllers can recognize the fault and re-create the data from redundant information stored on other drives. For example: 1. Record 1 is corrected from data stored in Record 2 on Drive 2 and check sum information on Drive 3. The ServeRAID controller requests that Record 1 be rewritten. 2. The drive remaps the bad sector elsewhere on the drive. Record 1 now has good data. In this example, RAID-5 has increased the availability of the information by recreating data that otherwise would have been lost. It is initially assumed that this process was initiated by accessing this data on the drive. Were this data not accessed, this error would not be detected. This problem can be significant if a catastrophic failure occurs before the data is corrected. Combination failure protection: In this example, an undetected sector media error exists within Record 1 of Disk 1. This error occurred within an archived section of the user's database that is seldom accessed. Before this error is recognized and corrected, a "catastrophic" failure of Drive 2 is sustained. So far, no data problems are noticed. User requests for information other than Record 1 can still be serviced with RAID protection and data recalculation. When Drive 2 is replaced and a rebuild is initiated, the ServeRAID controller attempts to recalculate Record 2 from the failed Drive 2 by 208 Hardware Maintenance Manual: Netfinity 7600 - Type 8665 Models 1RY, 2RY

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68
  • 69
  • 70
  • 71
  • 72
  • 73
  • 74
  • 75
  • 76
  • 77
  • 78
  • 79
  • 80
  • 81
  • 82
  • 83
  • 84
  • 85
  • 86
  • 87
  • 88
  • 89
  • 90
  • 91
  • 92
  • 93
  • 94
  • 95
  • 96
  • 97
  • 98
  • 99
  • 100
  • 101
  • 102
  • 103
  • 104
  • 105
  • 106
  • 107
  • 108
  • 109
  • 110
  • 111
  • 112
  • 113
  • 114
  • 115
  • 116
  • 117
  • 118
  • 119
  • 120
  • 121
  • 122
  • 123
  • 124
  • 125
  • 126
  • 127
  • 128
  • 129
  • 130
  • 131
  • 132
  • 133
  • 134
  • 135
  • 136
  • 137
  • 138
  • 139
  • 140
  • 141
  • 142
  • 143
  • 144
  • 145
  • 146
  • 147
  • 148
  • 149
  • 150
  • 151
  • 152
  • 153
  • 154
  • 155
  • 156
  • 157
  • 158
  • 159
  • 160
  • 161
  • 162
  • 163
  • 164
  • 165
  • 166
  • 167
  • 168
  • 169
  • 170
  • 171
  • 172
  • 173
  • 174
  • 175
  • 176
  • 177
  • 178
  • 179
  • 180
  • 181
  • 182
  • 183
  • 184
  • 185
  • 186
  • 187
  • 188
  • 189
  • 190
  • 191
  • 192
  • 193
  • 194
  • 195
  • 196
  • 197
  • 198
  • 199
  • 200
  • 201
  • 202
  • 203
  • 204
  • 205
  • 206
  • 207
  • 208
  • 209
  • 210
  • 211
  • 212
  • 213
  • 214
  • 215
  • 216
  • 217
  • 218
  • 219
  • 220
  • 221
  • 222
  • 223
  • 224
  • 225
  • 226
  • 227
  • 228
  • 229
  • 230
  • 231
  • 232
  • 233
  • 234
  • 235
  • 236
  • 237
  • 238
  • 239
  • 240
  • 241
  • 242
  • 243
  • 244
  • 245
  • 246
  • 247
  • 248
  • 249
  • 250
  • 251
  • 252
  • 253
  • 254
  • 255
  • 256
  • 257
  • 258
  • 259
  • 260
  • 261
  • 262
  • 263
  • 264
  • 265
  • 266
  • 267
  • 268
  • 269
  • 270
  • 271
  • 272
  • 273
  • 274
  • 275
  • 276
  • 277
  • 278
  • 279
  • 280
  • 281
  • 282
  • 283
  • 284
  • 285
  • 286
  • 287
  • 288
  • 289
  • 290
  • 291
  • 292
  • 293
  • 294

208
Hardware Maintenance Manual: Netfinity 7600
Type 8665 Models 1RY, 2RY
RAID-5 controller level protection:
At the controller level, RAID-5 has become an
industry standard method to provide increased availability for servers. RAID-5 and
RAID-1 implementations allow servers to continue operation even if there is a
catastrophic
failure of a hard drive.
During normal operations in a RAID-5 environment, redundant information is
calculated and written out to the drives. In a
n
disk environment,
n
-1 disks of data are
provided with one disk of space dedicated to redundant
check sum
or
parity
information. For example, three 2GB drives provide 4GB of data space and 2GB of
redundancy.
Note:
The redundant data is actually spread out over all the disks for performance
reasons.
Catastrophic disk failure protection:
If a drive that is a member of a RAID-5 array fails,
the remaining members of the array can use their redundant information to
recalculate the lost data, either to respond to user requests for data or to rebuild the
data stored on the lost drive when it is replaced with a new one.
For example, information in Record 1 from Drive 1 is combined with the check sum
information on Drive 3 to recreate information that is not available from Drive 2.
As long as the array controller can access the remaining
n
-1 drives, the rebuild will be
successful. Naturally, if a second disk failure were to suddenly occur, the array and its
data would be lost. RAID-5 can only protect against the loss of a single drive.
Grown sector media error protection:
In this scenario, as the drive attempts to read data
in a read request, it determines that Record 1 of Disk 1 has a bad sector. If the media
error is minor, the drive corrects or remaps the information using the drive ECC
information, which is transparent to the RAID array.
If the disk cannot recreate the information from the ECC information on the drive, the
controller determines if the data is still lost, as it was without RAID support. In such a
case, ServeRAID controllers can recognize the fault and re-create the data from
redundant information stored on other drives.
For example:
1.
Record 1 is corrected from data stored in Record 2 on Drive 2 and check sum
information on Drive 3.
The ServeRAID controller requests that Record 1 be rewritten.
2.
The drive remaps the bad sector elsewhere on the drive. Record 1 now has good
data.
In this example, RAID-5 has increased the availability of the information by re-
creating data that otherwise would have been lost. It is initially assumed that this
process was initiated by accessing this data on the drive. Were this data not accessed,
this error would not be detected. This problem can be significant if a catastrophic
failure occurs before the data is corrected.
Combination failure protection:
In this example, an undetected sector media error exists
within Record 1 of Disk 1.
This error occurred within an archived section of the user's database that is seldom
accessed. Before this error is recognized and corrected, a
catastrophic
failure of
Drive 2 is sustained. So far, no data problems are noticed.
User requests for information other than Record 1 can still be serviced with RAID
protection and data recalculation. When Drive 2 is replaced and a rebuild is initiated,
the ServeRAID controller attempts to recalculate Record 2 from the failed Drive 2 by