HP SureStore 7400 Linux Configuration HP VA 7100/7400 - Page 10

The behavior of our patch to the qlogicfc.c driver is not yet ideal. It assumes that

Page 10 highlights

Linux Configuration HP VA 7100/7400 The major distinguishing behavior of the new recovery function is that there are now very few conditions in which a device will ever be marked offline. More specifically, the new error handler function will attempt to abort and retry any outstanding commands that have failed or timed out. If this retry also fails, the function will wait 15 seconds and try again. It continues aborting/retrying every 15 seconds for up to 5 minutes. At 5 minutes, it attempts to reset the offending device (this reset function does not appear to be implemented in the qlogicfc driver), if this reset fails, it attempts to reset the bus, and then finally the host adapter itself. If none of these resets allows for successful retirement of outstanding SCSI commands, the new error recovery function loops back and begins the entire recovery process from the top. This continues indefinitely until the device returns. The code is very straightforward and can be easily customized to eventually mark the device offline if desired, after any combination or duration of aborts/retrys/resets. It should be noted that the qla2x00.c driver, which is available from Qlogic's web page, attempts to address this same issue with what they refer to in their code as a "SCSI kluge". Their solution involves detecting when the fibre-channel loop is down and bumping the timer values on outstanding SCSI commands to prevent them from timing out. They then perform their own retries from a separate lowlevel command queue. The qla2x00.c driver does not currently take advantage of the new error handling architecture and does not address the case where a device is timing out but is still present on the fibre-channel loop. In such cases, a timeout that occurs against a device that is still present on the loop can still result in that device being marked offline with data loss and/or filesystem corruption. The "SCSI kluge" provided with that driver protects only against timeouts that occur because the link is down (device unplugged from the loop). Caveats: The behavior of our patch to the qlogicfc.c driver is not yet ideal. It assumes that a missing or timed-out device will eventually return to its original position. If the device will not return, or returns in some other device location on the system, then there is currently no means in place to manually notify the new error handler that it should give up and mark the missing device offline. While the error recovery function is activated, all new or existing processes with outstanding I/O to that Qlogic HBA will remain blocked in error recovery waiting for the device to return, and they cannot be killed. We have made the assumption that "missing devices that never return" will be rare and that our approach is preferable to the potential for lost data or filesystem corruption for "missing devices that eventually return". In any case, this patch is offered as a safeguard and an example which can be easily customized to provide a wide range of recovery policies specifically suited to Qlogicattached devices. Rev 2002-01-23 Page 10

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14

Linux Configuration
HP VA 7100/7400
Rev 2002-01-23
Page 10
The major distinguishing behavior of the new recovery function is that there are now very few
conditions in which a device will ever be marked offline.
More specifically, the new error handler function will attempt to abort and retry any outstanding
commands that have failed or timed out. If this retry also fails, the function will wait 15 seconds and
try again. It continues aborting/retrying every 15 seconds for up to 5 minutes. At 5 minutes, it
attempts to reset the offending device (this reset function does not appear to be implemented in the
qlogicfc driver), if this reset fails, it attempts to reset the bus, and then finally the host adapter itself. If
none of these resets allows for successful retirement of outstanding SCSI commands, the new error
recovery function loops back and begins the entire recovery process from the top. This continues
indefinitely until the device returns. The code is very straightforward and can be easily customized to
eventually mark the device offline if desired, after any combination or duration of aborts/retrys/resets.
It should be noted that the qla2x00.c driver, which is available from Qlogic’s web page, attempts to
address this same issue with what they refer to in their code as a “SCSI kluge”. Their solution involves
detecting when the fibre-channel loop is down and bumping the timer values on outstanding SCSI
commands to prevent them from timing out. They then perform their own retries from a separate low-
level command queue. The qla2x00.c driver does not currently take advantage of the new error
handling architecture and does not address the case where a device is timing out but is still present on
the fibre-channel loop. In such cases, a timeout that occurs against a device that is still present on the
loop can still result in that device being marked offline with data loss and/or filesystem corruption.
The “SCSI kluge” provided with that driver protects only against timeouts that occur because the link
is down (device unplugged from the loop).
Caveats:
The behavior of our patch to the qlogicfc.c driver is not yet ideal. It assumes that a
missing or timed-out device will eventually return to its original position. If the device will not return,
or returns in some other device location on the system, then there is currently no means in place to
manually notify the new error handler that it should give up and mark the missing device offline.
While the error recovery function is activated, all new or existing processes with outstanding I/O to
that Qlogic HBA will remain blocked in error recovery waiting for the device to return, and they
cannot be killed. We have made the assumption that “missing devices that never return” will be rare
and that our approach is preferable to the potential for lost data or filesystem corruption for “missing
devices that eventually return”. In any case, this patch is offered as a safeguard and an example which
can be easily customized to provide a wide range of recovery policies specifically suited to Qlogic-
attached devices.