Linux SCSI error handling
Home Up SCSI queue handling New SCSI eh code Kernel Simulator scsidev utliity Linux SCSI error handling General SCSI Docs My TODO list for SCSI

 

  1. Introduction
  2. New features
  3. Description of logging facility
  4. Description of new host template format
  5. Description of changes to low-level drivers required for new EH code.
  6. Download patches

 

Introduction

At the moment a rewrite of the error handling code is underway. To understand the reasons that a rewrite is a good idea, it is important to understand many of the shortcomings of the existing code.

The existing error handling code is an imperfect state machine. It would occasionally get stuck in loops whereby the bus would be reset over and over again, where the problem would never be resolved and control of the machine would never return to the user
Initializing the timer used for timeouts could be exceedingly slow on machines with large numbers of scsi devices.
Whenever a command was completed by a low-level driver, the results were passed all the way up to the top level scsi code, which in turn could queue a new command as a replacement. This would all take place with interrupts turned off, which could lead to severe latency problems.
Error handling was always performed in the context of an interrupt handler. This limited our ability to correct problems in that we were required to perform actions that would lead to another interrupt at a later time at which point we could decide how to update the state machine.
Error handling was always performed while other commands were being sent to the host. A good analogy would be trying to change the fan belt on a car while the engine is running.
The error/condition logging facilities were extremely primitive. Typically the user had to recompile the kernel with printk() statements added or uncommented, run the kernel, to collect the information and then report the results back. This isn't fun.

The fact that many of these shortcomings exist has been known for a considerable time. The problem has always been that until you have a better idea for how to do something, you cannot write a replacement.

It turns out that pieces of ideas had been floating about amongst a number of Linux developers, and at the Linux Expo in Raleigh NC in April a number of us came together and were able to put together a rough sketch for what a reasonable replacement would be. Later email discussions with other Linux developers resulted in enhancements and refinements to these ideas. The patches that are on this site are a work in progress that implement the ideas that we came up with.

There are other areas of the scsi system which need attention. At the moment, the error handling is the highest priority, since this is where things tended to get screwed up the most often. The additional areas that will undergo work in the near future are:

The dev_t is currently 16 bits, which limits us to 16 disk drives. While this may sound like a lot to people sitting at home with a Linux box, large servers tend to have many hundreds of disks. The actual solution to the problem is mostly done - there is support for a generic kdev_t in the kernel which doesn't have to be the same as the external dev_t. Thus we could internally convert to a 64 bit dev_t, and then add some new system calls that would allow mknod and stat to correctly read the results. Most of the filesystems are supposed to be storing a full 64 bits for dev_t now.
The mapping of devices to minor numbers turns out to be problematic when you have very large numbers of devices. The way things currently are, when one device fails to spin up the numbering of all of the others can be affected, and thus the system would be trying to mount the wrong devices on the wrong mount points. One suggestion was to switch to a sparse mapping of devices to minor numbers (once we get a larger dev_t), which would eliminate this problem. I am wondering whether it would make sense to leapfrog this, and instead give the kernel some knowledge of volume IDs, which would allow a user application to query for the device on which a specific volume is attached. This has the strong advantage that it would be more intuitive, but unlabeled partitions could still be somewhat troublesome.
The upper scsi layers have a lot of logic in them which probably should be at a higher level. There is work underway by others to rewrite the layer at ll_rw_blk.c so as to make it possible to simplify the logic in the upper scsi layers.

There are lots of ideas floating around about exactly what sorts of changes make sense - as we get closer to actually implementing them the details will be filled in here.

New Features

Despite the fact that we are only currently working on the error handling code, the patches are rather large. The reason for this is that there are a fairly large number of things which we are trying to accomplish at the same time. Here is a summary of some of the new features. A few will be discussed in more detail below.

Fast timers are now used for all scsi commands. This idea came from Ingo. This should be a significant performance enhancement.
A bottom half handler is now used for completion processing. When a low-level driver finishes a command the mid-level function which receives it merely puts it in a queue and requests that the bottom handler finish processing. There are many advantages - the main ones are that re-entrancy of the low-level driver is drastically reduced, and secondly interrupt latency is reduced considerably.
Error handling is now done in a separate kernel thread. The kernel thread is never used unless an error has taken place which requires some sort of attention (i.e. abort, reset, etc). Note - low-level drivers that don't automatically request sense information when a CHECK_CONDITION is received will be bumped into the error handling thread, where the sense information will be requested.
When the error handling thread is woken, all further queuing of commands to the host in question is stopped. The error handling thread does not begin work until all commands have either succeeded, failed or timed out. This will ensure that the error handling thread is running in an environment where everything is quiet and stable.
The error handling thread is more or less just like another process. This means that it is legal to sleep, and this means that error handling doesn't need to be done with a giant state machine. A much simpler linear flow of control will accomplish the same task. In addition, it is now possible for the error handling thread to take a device offline if all attempts to correct the problem fail, and this means that all outstanding commands to the device will fail. There will be no processes hung waiting for an I/O operation to complete.
While the eventual goal is to convert all of the drivers over to use the new error handling code, it may take some time for some of the maintainers to get caught up. It is possible to select with a simple boolean whether a given low-level driver uses the old or the new error handling code.
A logging facility has been added to the kernel to facilitate troubleshooting. This can be selectively enabled by means of simple shell commands, and it can also be enabled from the LILO kernel command prompt.

Description of new logging facility

The logging facility will make it much easier to troubleshoot problems in the field. The basic idea is that it is possible to request debugging information be enabled at any given instant in time, which will be placed into the kernel log file. The general idea is that there are a number of different facilities within the kernel, each of which can generate messages at varying levels of verbosity. When you have specific problems in one area, you can turn on highly verbose logging for the messages in that subsystem, and leave the rest off, or put some of the others at lower levels of verbosity.

Each of the different facilities have names.  The valid names are: all, none, timeout, scan, mlqueue, mlcomplete, llqueue, llcomplete, hlqueue, hlcomplete, and ioctl.  Each facility can have a verbosity level from 0 to 7, where 0 indicates completely quiet, and 7 represents maximum verbosity.

There are two fundamental ways of turning this on - one is via the LILO command line:

scsi_logging=N

where N is a number that represents the logging mask. There isn't a really good symbolic way of describing what you are really turning on from the kernel command line - you are essentially specifying a 32 bit quantity that has all of the logging levels stored in it. The easiest way to find the number you actually want is by setting the desired logging level from the /proc filesystem, and then recording the logging mask that is reported.

The other way to turn on logging is via the /proc filesystem. A shell command like:

# echo "scsi log timeout 5" > /proc/scsi/scsi

will set the verbosity level to 5 for events related to timeout handling. If you are lazy and unsure where the problem is, you can set all of the facilities to the maximum verbosity level with the following command:

# echo "scsi log all" > /proc/scsi/scsi

Similarly, it is possible to turn off all logging with the following command:

# echo "scsi log none" > /proc/scsi/scsi

Note - there is a kernel configuration option that can be used to enable/disable logging. The presence of the logging code shouldn't present any noticeable run-time overhead, as the testing to see whether a given facility is at a sufficiently high logging level is all done inline without any function calls. The only noticeable effect of having the logging enabled is that the kernel will be slightly larger, as a number of printk() and other statements will need to be present.

Description of new format of host templates

All of the host templates have been reformatted to make it easier to add new functions that might only be used by a few hosts. The fundamental point is that GCC allows structures to be initialized with the following syntax:

struct foo{ int mmm; int nnn};

struct foo mystruct = {mmm: 3};

Syntactically you are able to individually specify the initializer for each element by name. Any elements that aren't specified are initialized to 0 instead. The advantage of this type of initializer is that it would be possible to add a new function pointer to the structure definition for the special purpose use of one particular driver, but none of the other drivers would need source modifications to fill in a new 'NULL' placeholder in the initializer.

Driver modifications to use new error handling code

There are a number of relatively minor changes that will be required to take advantage of the new error handling code. The first and most important is that there is a field in the host template which is a boolean value that indicates whether the driver will be using the old or the new error handling code. To illustrate this, I will give an example of one host template that contains all of the required changes:

#define AHA1542 { proc_dir: &proc_scsi_aha1542, \

name: "Adaptec 1542", \
detect: aha1542_detect, \
command: aha1542_command, \
queuecommand: aha1542_queuecommand, \
abort: aha1542_old_abort, \
reset: aha1542_old_reset, \
eh_abort_handler: aha1542_abort, \
eh_device_reset_handler: aha1542_dev_reset, \
eh_bus_reset_handler: aha1542_bus_reset, \
eh_host_reset_handler: aha1542_host_reset, \
bios_param: aha1542_biosparam, \
can_queue: AHA1542_MAILBOXES, \
this_id: 7, \
sg_tablesize: AHA1542_SCATTER, \
cmd_per_lun: AHA1542_CMDLUN, \
unchecked_isa_dma: 1, \
use_clustering: ENABLE_CLUSTERING, \
use_new_eh_code: 1}

#endif

The boolean that controls whether the old or new error handling code is used is the use_new_eh_code field.

There are several other new fields. These are the eh_abort_handler, eh_device_reset_handler, eh_bus_reset_handler, and eh_host_reset_handler fields. Each of these is used by the mid-level code to request a specific action. Each of these takes a single argument, the SCpnt for the command in question (from which you can obtain the host structure, if need be), and each one of these returns either SUCCESS or FAILED. No other return codes are allowed. In addition the mid-level code will not be running a timer to ensure that the function returns within a specific period of time. If the operation could potentially get stuck, it is the responsibility of the low-level routine to add a timer to unblock and clean up. You are not required to add a timer, of course - only if it is possible for the operation to get stuck and never return.

You are not required to implement all of the eh_*_handler interfaces. Any that are unspecified are treated as if they returned FAILED, and the error handler thread will go onto the next level. In case it isn't clear from the above, discussion, here are the prototypes of the new function pointers:

int (*eh_abort_handler)(Scsi_Cmnd *);
int (*eh_device_reset_handler)(Scsi_Cmnd *);
int (*eh_bus_reset_handler)(Scsi_Cmnd *);
int (*eh_host_reset_handler)(Scsi_Cmnd *);

Finally, the queuecommand interface must correctly return either TRUE or FALSE. TRUE will indicate that the command was successfully queued to the host adapter, FALSE indicates that for some reason the command couldn't be queued, and that the mid-level should try again later. Note - the current patches don't yet support this in the mid-layer, but this is the eventual goal. The main reason for this change is to improve compatibility with BSD derived drivers.

Note - this is a work in progress. There will be bugs, but I have tried a number of different host adapters myself in order to try and catch as many as I can.

Note2 - The bulk of the patches were added to the 2.1.75 kernel. Currently only the 1542 is using the new error handling code - all other drivers are continuing to use the old error handling code until we gain a bit more confidence in the new code.

Download kernel patches (2.1.41)

Download kernel patches (2.1.59)

Download kernel patches (2.1.71)

Patches already applied in 2.1.75 kernel.

If you have comments or suggestions, you can email me by clicking here

Last updated: 1/7/98.