HP SureStore 7400 Linux Configuration HP VA 7100/7400 - Page 10

Linux Configuration

HP VA 7100/7400

Rev 2002-01-23

Page 10

The major distinguishing behavior of the new recovery function is that there are now very few

conditions in which a device will ever be marked offline.

More specifically, the new error handler function will attempt to abort and retry any outstanding

commands that have failed or timed out. If this retry also fails, the function will wait 15 seconds and

try again. It continues aborting/retrying every 15 seconds for up to 5 minutes. At 5 minutes, it

attempts to reset the offending device (this reset function does not appear to be implemented in the

qlogicfc driver), if this reset fails, it attempts to reset the bus, and then finally the host adapter itself. If

none of these resets allows for successful retirement of outstanding SCSI commands, the new error

recovery function loops back and begins the entire recovery process from the top. This continues

indefinitely until the device returns. The code is very straightforward and can be easily customized to

eventually mark the device offline if desired, after any combination or duration of aborts/retrys/resets.

It should be noted that the qla2x00.c driver, which is available from Qlogic’s web page, attempts to

address this same issue with what they refer to in their code as a “SCSI kluge”. Their solution involves

detecting when the fibre-channel loop is down and bumping the timer values on outstanding SCSI

commands to prevent them from timing out. They then perform their own retries from a separate low-

level command queue. The qla2x00.c driver does not currently take advantage of the new error

handling architecture and does not address the case where a device is timing out but is still present on

the fibre-channel loop. In such cases, a timeout that occurs against a device that is still present on the

loop can still result in that device being marked offline with data loss and/or filesystem corruption.

The “SCSI kluge” provided with that driver protects only against timeouts that occur because the link

is down (device unplugged from the loop).

Caveats:

The behavior of our patch to the qlogicfc.c driver is not yet ideal. It assumes that a

missing or timed-out device will eventually return to its original position. If the device will not return,

or returns in some other device location on the system, then there is currently no means in place to

manually notify the new error handler that it should give up and mark the missing device offline.

While the error recovery function is activated, all new or existing processes with outstanding I/O to

that Qlogic HBA will remain blocked in error recovery waiting for the device to return, and they

cannot be killed. We have made the assumption that “missing devices that never return” will be rare

and that our approach is preferable to the potential for lost data or filesystem corruption for “missing

devices that eventually return”. In any case, this patch is offered as a safeguard and an example which

can be easily customized to provide a wide range of recovery policies specifically suited to Qlogic-

attached devices.

HP SureStore 7400 Linux Configuration HP VA 7100/7400 - Page 10

The behavior of our patch to the qlogicfc.c driver is not yet ideal. It assumes that

Page 10 highlights