Dell PowerEdge T140 EMC PowerEdge Servers Troubleshooting Guide - Page 98

Slicing, RAID puncture, Causes of RAID puncture

Page 98 highlights

Slicing Configuring multiple RAID arrays across the same set of disks is called Slicing. RAID puncture A RAID puncture is a feature of Dell PowerEdge RAID Controller (PERC) designed to allow the controller to restore the redundancy of the array despite the loss of data caused by a double fault condition. Another name for a RAID puncture is rebuild with errors. When the RAID controller detects a double fault and there is insufficient redundancy to recover the data in the impacted stripe, the controller creates a puncture in that stripe and enables the rebuild to continue. • Any condition that causes data to be inaccessible in the same stripe on more than one drive is a double fault. • Double faults cause the loss of all data within the impacted stripe. • All RAID punctures are double faults but all double faults are NOT RAID punctures. Causes of RAID puncture Without the RAID puncture feature, the array rebuild would fail, and leave the array in a degraded state. In some cases, the failures may cause additional drives to fail, and cause the array to be in a non-functioning offline state. Puncturing an array has no impact on the ability to boot to or access any data on the array. RAID punctures can occur in one of two situations: • Double Fault already exists (Data already lost). Data error on an online drive is propagated (copied) to a rebuilding drive. • Double Fault does not exist (Data is lost when second error occurs). While in a degraded state, if a bad block occurs on an online drive, that LBA is RAID punctured. This advantage of puncturing an array is keeping the system available in production till the redundancy of the array is restored. The data in the affected stripe is lost whether the RAID puncture occurs or not. The primary disadvantage of this method is that while the array has a RAID puncture in it, uncorrectable errors will continue to be encountered whenever the impacted data (if any) is accessed. A RAID puncture can occur in the following three locations: • In blank space that contains no data. That stripe will be inaccessible, but since there is no data in that location, it will have no significant impact. Any attempts to write to a RAID punctured stripe by an OS will fail and data will be written to a different location. • In a stripe that contains data that isn't critical such as a README.TXT file. If the impacted data is not accessed, no errors are generated during normal I/O. Attempts to perform a file system backup will fail to backup any files impacted by a RAID puncture. Performing a Check Consistency or Patrol Read operations will generate Sense code: 3/11/00 for the applicable LBA and/or stripes. • In data space that is accessed. In such a case, the lost data can cause a variety of errors. T he errors can be minor errors that do not adversely impact a production environment. The errors can also be more severe and can prevent the system from booting to an operating system, or cause applications to fail. An array that is RAID punctured will eventually have to be deleted and recreated to eliminate the RAID puncture. This procedure causes all data to be erased. The data would then need to be recreated or restored from backup after the RAID puncture is eliminated. The resolution for a RAID puncture can be scheduled for a time that is more advantageous to needs of the business. If the data within a RAID punctured stripe is accessed, errors will continue to be reported against the affected bad LBAs with no possible correction available. Eventually (this could be minutes, days, weeks, months, and so on), the Bad Block Management (BBM) Table will fill up causing one or more drives to become flagged as predictive failure. As seen in the figure, drive 0 will typically be the drive that gets flagged as predictive failure due to the errors on drive 1 and drive 2 being propagated to it. Drive 0 may actually be working normally, and replacing drive 0 will only cause that replacement to eventually be flagged predictive failure as well. 98 Troubleshooting hardware issues

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68
  • 69
  • 70
  • 71
  • 72
  • 73
  • 74
  • 75
  • 76
  • 77
  • 78
  • 79
  • 80
  • 81
  • 82
  • 83
  • 84
  • 85
  • 86
  • 87
  • 88
  • 89
  • 90
  • 91
  • 92
  • 93
  • 94
  • 95
  • 96
  • 97
  • 98
  • 99
  • 100
  • 101
  • 102
  • 103
  • 104
  • 105
  • 106
  • 107
  • 108
  • 109
  • 110
  • 111
  • 112
  • 113
  • 114
  • 115
  • 116
  • 117
  • 118
  • 119
  • 120
  • 121
  • 122
  • 123
  • 124
  • 125
  • 126
  • 127
  • 128
  • 129
  • 130
  • 131
  • 132

Slicing
Configuring
multiple RAID arrays across the same set of disks is called Slicing.
RAID puncture
A RAID puncture is a feature of Dell PowerEdge RAID Controller (PERC) designed to allow the controller to restore the redundancy of the
array despite the loss of data caused by a double fault condition. Another name for a RAID puncture is rebuild with errors. When the RAID
controller detects a double fault and there is
insufficient
redundancy to recover the data in the impacted stripe, the controller creates a
puncture in that stripe and enables the rebuild to continue.
Any condition that causes data to be inaccessible in the same stripe on more than one drive is a double fault.
Double faults cause the loss of all data within the impacted stripe.
All RAID punctures are double faults but all double faults are NOT RAID punctures.
Causes of RAID puncture
Without the RAID puncture feature, the array rebuild would fail, and leave the array in a degraded state. In some cases, the failures may
cause additional drives to fail, and cause the array to be in a non-functioning
offline
state. Puncturing an array has no impact on the ability
to boot to or access any data on the array.
RAID punctures can occur in one of two situations:
Double Fault already exists (Data already lost).
Data error on an online drive is propagated (copied) to a rebuilding drive.
Double Fault does not exist (Data is lost when second error occurs).
While in a degraded state, if a bad block occurs on an online drive, that LBA is RAID punctured.
This advantage of puncturing an array is keeping the system available in production till the redundancy of the array is restored. The data in
the
affected
stripe is lost whether the RAID puncture occurs or not. The primary disadvantage of this method is that while the array has a
RAID puncture in it, uncorrectable errors will continue to be encountered whenever the impacted data (if any) is accessed.
A RAID puncture can occur in the following three locations:
In blank space that contains no data. That stripe will be inaccessible, but since there is no data in that location, it will have no
significant
impact. Any attempts to write to a RAID punctured stripe by an OS will fail and data will be written to a
different
location.
In a stripe that contains data that isn't critical such as a README.TXT
file.
If the impacted data is not accessed, no errors are generated
during normal I/O. Attempts to perform a
file
system backup will fail to backup any
files
impacted by a RAID puncture. Performing a
Check Consistency or Patrol Read operations will generate Sense code: 3/11/00 for the applicable LBA and/or stripes.
In data space that is accessed. In such a case, the lost data can cause a variety of errors. T he errors can be minor errors that do not
adversely impact a production environment. The errors can also be more severe and can prevent the system from booting to an
operating system, or cause applications to fail.
An array that is RAID punctured will eventually have to be deleted and recreated to eliminate the RAID puncture. This procedure causes all
data to be erased. The data would then need to be recreated or restored from backup after the RAID puncture is eliminated. The resolution
for a RAID puncture can be scheduled for a time that is more advantageous to needs of the business.
If the data within a RAID punctured stripe is accessed, errors will continue to be reported against the
affected
bad LBAs with no possible
correction available. Eventually (this could be minutes, days, weeks, months, and so on), the Bad Block Management (BBM) Table will
fill
up
causing one or more drives to become
flagged
as predictive failure. As seen in the
figure,
drive 0 will typically be the drive that gets
flagged
as predictive failure due to the errors on drive 1 and drive 2 being propagated to it. Drive 0 may actually be working normally, and replacing
drive 0 will only cause that replacement to eventually be
flagged
predictive failure as well.
98
Troubleshooting hardware issues