New SCSI eh code
Home Up SCSI queue handling New SCSI eh code Kernel Simulator scsidev utliity Linux SCSI error handling General SCSI Docs My TODO list for SCSI

 

  1. Introduction
  2. What was wrong with old code
  3. How the new code corrects this
  4. How to convert a driver
  5. Other Cleanups
  6. Future directions
  7. Other problem areas

Introduction

It seems that very few low-level driver maintainers have made the effort to convert over and use the new error handling code. I would like to take this opportunity to explain the many advantages of switching over.

As you all know, the old error handling code is quite imperfect. I see all kinds of bug reports where people get an infinite stream of SCSI resets, and the system is more or less hopelessly wedged. This is a common symptom, and given the way the old code worked, wouldn't be easy to fix.

The point of this message is to try and sell the concept, and encourage driver maintainers to make the effort to switch over. It isn't that hard, and the benefits are definitely worthwhile. A second objective of this note is to explain in detail what a driver author would need to do.

Finally, I should point out that this stuff isn't set in stone. If anyone has objections/comments, please send them on. In particular, some people have expressed an interest in doing their own timeout handling (i.e. a boolean in the host template which would tell the midlevel not to start a timer before the queuecommand interface is called). I don't want to make the change unless someone seriously intends to use it, as I would like to choose an implementation that suits the needs of the people who would want to use it.

A point about terminology - I still tend to refer to mid-layer, upper-layer, and lower-layer, mainly out of force of habit. With the new queuing code, the upper level drivers (disk, cdrom, etc) aren't really so much a layer any more, but I haven't decided what to call it instead. Similarly, the mid-level isn't really so much a level, but represents the scsi core, I suppose.

What was wrong with the old code

Let me start by summing up some of the negative features of the old error handling code:
The old error handling code is an imperfect state machine which always runs in the context of the interrupt handler. This limits the flexibility to a considerable degree, and in addition the code is extremely complicated and difficult to maintain.
When you are using the old code, error recovery takes place at the same time as new commands being queued on the bus. One frequent outcome is that error recovery actually makes the situation worse by screwing up commands to devices that are functioning normally.
The old code uses timers while the abort and reset functions are running in the low-level driver, and this is done in an attempt to assure that the operations complete in a timely fashion. Timeouts for these operations are not handled well, and there is no notification to the low-level driver that this has taken place. Thus there is a considerable amount of confusion as to exactly who has control over the system.
The old code was "designed" long ago before SMP safety became an issue. Currently we just hold a lock for the entire time that error recovery is active, and while this may work, it is less than optimal.

How the new code corrects this

Architecturally there are a couple of really interesting points about the new error handling code.
The first major difference is that when an error is detected which requires error handling, instead of trying to handle it immediately, what happens instead is that the a flag is set for the host in question, and this flag will prevent any further commands from being queued to the host. The point of this is that the system will be allowed to settle until all outstanding commands to the host in question have either completed, failed or timed out.
Once this quiescent state is reached, the error handler thread is woken up. All error handling will take place in the context of this thread. Once error handling is complete, the flag for the host is cleared (meaning that new commands are allowed to once again be queued), and the error handler thread will go back to sleep.
When you are using the new error handling code, the normal command completion takes place from a bottom-half handler. Thus when you call SCpnt->scsi_done() interface, this command will merely be inserted into a (hopefully) short queue that will get run the next time the bottom half handler runs (which should be as soon as your interrupt service routine returns). You do not need to take any special action to take advantage of this feature - you get it for free when you use the new error handling code.
Once a command has timed out, then attempts by the low-level to report status are ignored. There used to be a race condition here, I fixed this by looking at the return code of delete_timer(). If the timer has already fired, then it is too late to attempt completion, as the error handler thread may also be active. Hmm, this isn't quite right - I need to fix this so that we set a flag which says that the answer came in, but came in late.

How to convert a driver

Converting a driver to use the new error handling code is fairly easy, and consists of several steps. These are summarized here:

  1. Ensure that sense data is automatically requested for CHECK_CONDITION.
  2. Define the new error handing entrypoints.
  3. Fix the queuecommand entrypoint to return a meaningful value.
  4. Set the use_new_eh_code flag in the host template.
Each of these steps is described in detail below. In addition to these steps, please also see the cleanups section of this document as well as the future directions section.

Ensure that sense data is automatically requested on a CHECK_CONDITION

The idea here is that sometimes when we look at the results of a command execution, that the command might have returned a CHECK_CONDITION, meaning that something went wrong. The sense data has details about what went wrong. If the driver doesn't get this automatically, then the error handler is forced to wake up and ask for it explicitly. While not a strict requirement that you get the sense data, it tends to streamline things quite a bit, as the error handler thread is only involved when there is something that really needs attention.

Defining new entrypoints

There are actually two separate ways which a driver author could to do make use of the new error handling code. They can be roughly called the "easy" and the "hard" way.

The easy way is designed to make it easy for people to convert from the old style error handling to the new style. The prescription is as follows:

Define up to 4 functions in the driver:

      int (*eh_abort_handler)(Scsi_Cmnd *);
      int (*eh_device_reset_handler)(Scsi_Cmnd *);
      int (*eh_bus_reset_handler)(Scsi_Cmnd *);
      int (*eh_host_reset_handler)(Scsi_Cmnd *);

These things perform the obvious jobs. A couple of points here.
There is *NO* timer running when these things are running. If your functions might block indefinitely for any reason, it is up to the driver author to do their own timeout handling and restore sanity before returning.
These functions must return either SUCCESS or FAILED. No other return values are allowed. SUCCESS implies that the low-level driver succeeded in performing the requested task.
You are not required to supply all of these interfaces. Any which you do not define are treated as if a function were present that returned FAILED.
These functions are all called from the context of the error handler thread. This is not in an interrupt handler - thus it is safe to sleep, if there is some reason you need to do so.
You should deal with SMP safety for your entrypoints. Note that currently the io_request_lock is held whenever a low-level driver entrypoint is called - this is something I wish to change in the near future. Please see my comments below in the future directions section related to SMP issues and locking. As things stand now, the following should hold:
You should assume that when your driver is called through the queuecommand, error handling and detect routines that io_request_lock is held.
You should grab io_request_lock in your interrupt service routine, and release. it before returning.
You must use the irq variants of the spinlocks for your locking. If you do not, you will not block interrupts, and you could end up deadlocking the system as the interrupt handler might try and grab the lock while the function that was interrupted might already be holding it.
If you need to pause/delay for any purpose, the lock should be released and then re-asserted.
If you choose to use your own lock internal to your own driver, then you may use it instead and release io_request_lock within your own routines. You must grab it again before returning, however. Also, please note that io_request_lock must be held when you do a callback to the completion function in the mid-layer.
If you see deviations from this behavior (i.e. your drivers is getting called without io_request_lock held), please report it as a bug.
Note that the new error handling interfaces are always called from the context of the error handler thread, and there is a single error handler thread per instance of a host adapter (in other words, if there are multiple identical buslogic cards, there will be two error handler threads running). Thus there are probably no SMP issues that should require additional locking, however driver authors should keep in mind that if they have data structures shared by multiple instances of cards that they may need to protect the data structures in some way.
The default strategy routine is scsi_unjam_host() in scsi_error.c. There is a fairly simple flow of control from top to bottom, where it first tries simpler and safer things, and then moves on to try harder things. In the event that it is impossible to bring a device back to life, then the strategy routine has the option of marking a device offline. This means that at the high levels any attempt to use the device will fail. This is intended for those really pathological cases where the drive really goes nuts (or where media is removed from a removeable drive). I vaguely recall rumors that this presents a minor problem that it is hard to unmount a disk marked offline - I would like to hear from people who have this problem and we will see what we can do to improve the situation.

The hard way of doing this involves instead defining a single function:

      int (*eh_strategy_handler)(struct Scsi_Host *);
This is essentially just a hook which allows a low-level driver author to replace the entire contents of scsi_unjam_host() with a customized function. This is essentially provided for driver authors who really want to roll their own. At this time, I am not aware of any driver author actually attempting this. We will probably need to adjust the list of symbols exported from the mid-level if you wish to make use of this. Anyone wishing to try this will please contact me, as we may wish to rearrange the responsibilities a little bit.
The strategy function will be called from the context of the error handler thread once the host has reached a quiescent state (all commands either succeeded, failed or timed out). The mid-level still has the responsibility for blocking further commands and then waking up the error handler thread.
The return value is currently not significant, but I ask that you return 0 to allow for future enhancements.
Once the function returns, further commands will once again be allowed to be sent to the host, and the error handler thread will go back to sleep.
There is no timer running for this. Take as much time as you need for error recovery.

Fix queuecommand return value

In addition to defining the error handling functions, you will need to modify the queuecommand interface to return a meaningful value. A return value of 0 implies that the command was correctly queued. A non-zero return value implies that the host was unable to accept the command (probably because of some resource shortage) - in this case the command is left on the queue for the device. No further commands will be sent to this host until some command that is currently running on the host completes. Note that it is assumed that at least one command can be queued to a host at any given time.

Set the use_new_eh_code flag

Finally, you need to set the use_new_eh_code flag to 1 in the host template. This essentially tells the mid-layer that the driver is ready to use the new error handling code. This should be a trivial one-line change to your host template - probably just adding a line like:

		     use_new_eh_code:		1,	\
should be sufficient.

 

Other cleanups

As long as people are working on these things, I have a request to make. I would like people to delete the command() interface if there is also a queuecommand() interface. The command() interface will never be used if there is a queuecommand interface(). At the same time, check for any other dead functions (completion functions that would be used to wake up the thread in the command() interface).

Future Directions

When the SCSI layer was first made SMP safe, locking was added all over the place. The general idea was to hold io_request_lock as long as control was within the SCSI layer. While this made the thing SMP-safe, in the long run it wasn't a great idea, as it led to a lot of latency issues - the problem is that as long as you hold this lock, all other block devices are prevented from doing any work. There is also an architectural problem whereby the low-level drivers are assuming that the mid-level is holding the io_request_lock when the low-level interfaces are called, and this isn't optimal. The fundamental issue here is that a finer granularity of lock needs to be used so as to make the system more responsive.

The new queuing code helps with a lot of this - new locks were added to protect specific data structures, and this reduced the need for holding io_request_lock in the mid-upper levels itself. I haven't changed anything WRT locking and low-level drivers, however.

Typically there are two major entrypoints that are of significance. The queuecommand() interface for inserting new commands, and the interrupt handler which processes commands that are now done. As I mentioned before, the error handling routines are always run from the context of a single error handler thread. There is the possibility that an interrupt will arrive for a timed out command during error processing, and thus it is also important that you also do the appropriate locking in each one of the error handling strategy routines (assuming that they actually do anything in the first place).

Some driver authors have taken it upon themselves to implement their own locking. Thus upon entering the queuecommand entrypoint, they immediately release the io_request_lock, and then grab the actual low-level lock that they implemented. This is mostly the right solution, but there is still some small overhead in locking/unlocking io_request_lock when we don't need to be.

My long range goal is to turn things around so that the low-level drivers are completely responsible for their own locking, and make no assumptions about what locks might be held upon entry. In addition, I would like it if none of the low-level drivers touched io_request_lock - each driver should have it's own locks for this purpose.

It would be theoretically possible for the mid-layer to use finer grained locks when calling the low-level drivers, but I believe that this is not the correct approach. It is architecturally more correct to instead insist that each low-level driver be responsible for any locking to protect it's own data structures.

Ideally I would also like to eliminate all references to io_request_lock from low-level drivers. To accomplish this, I am thinking of adding a flag to the host template "smp_safe". If you do not set this (default case), then io_request_lock is assumed to be held when the mid-level calls into a low-level driver. In addition, it will also assumed that when the low-level driver calls back to the mid-level to report command completion, that the low-level driver has taken hold of this lock.

If you set "smp_safe", then these assumptions are reversed. In particular, it is assumed that the low-level driver does all of it's own locking and thus the mid-level does not grab io_request_lock prior to calling the low-level driver. The low-level driver really shouldn't use io_request_lock at all.

A low-level driver can protect it's data structures in a couple of ways. The two main choices are to use a driver specific spinlock, or to use a host specific spinlock. The difference is only apparent and important if there are multiple instances of identical cards on a system. In the case of a driver specific spinlock, there is one spinlock shared by all hosts that are using the driver. The case of a host-specific spinlock would imply that there is one spinlock per instance of host. In the event that you choose to use host specific spinlocks, beware of accessing any data structures shared between hosts.

Finally, if you set "smp_safe", there is no lock held for the duration of error recovery. Thus it may be the case that the eh_abort_handler() is called for a command that may have already completed. The error handling routines need to allow for the possibility that the command is no longer running on the host, and in such an event just return FAILED. It is important that the interrupt service routine and the error handler functions each grab the same lock prior to doing work in order to eliminate race conditions.

At this point, I haven't coded the actual changes. In terms of the mid-level, I would need to define the flag in the host template, and then fix it so that the io_request_lock isn't grabbed when calling the queuecommand function. In addition, the new error handling code will need to be examined to make sure that it is SMP-safe without holding the lock, and then fix it so that this lock isn't held during error processing. Note that I have no intention of trying to support smp_safe for drivers that use the old error handling code. I don't know when I will get around to trying to do this - it will be after the new queuing code is in and has been shaken down for a few months.

Other known problems

There are a couple of other known problems with SCSI - sometimes it is easy to forget about them as the sands of time pass. I have enumerated these here: scsi_todo.html. This is in addition to the stuff I discussed in future directions.

If you have comments or suggestions,

Last updated: 1/3/00.