SCSI queue handling
Home Up SCSI queue handling New SCSI eh code Kernel Simulator scsidev utliity Linux SCSI error handling General SCSI Docs My TODO list for SCSI

 

This site contains LINUX kernel patches for SCSI queue handling. These changes became part of the distribution as of 2.3.32.

  1. Introduction
  2. New features
  3. Current Status

Introduction

At the moment a rewrite of the scsi queue handling is underway. To understand the reasons that a rewrite is a good idea, it is important to understand many of the shortcomings of the existing code.

The existing code is very old, hard to maintain, and hard to understand. There are problems that nobody has dared to try and fix, mainly because they didn't understand the existing code well enough.
At the time of the design, the fastest card was an Adaptec 1542, and there was no such thing as a PCI bus. As you might imagine, there are some performance bottlenecks that become apparent with the newest hardware that would not have been evident back in the old days.
The existing code uses a single queue for handling all requests for disks, for example. This is a huge performance lose in the event that there are large numbers of devices that are active - the queue management has to try and sift through requests for all of the different devices, and order them (as if ordering them actually made any sense).
The existing code attempted to try and handle all of the different possibilities for queue handling requirements. By this I mean dealing with issues related to unchecked ISA DMA, scatter-gather, and possible clustering. This is wasted effort in the event that the device doesn't require all of this extra checking - modern scsi host adapters are on the PCI bus and thus DMA considerations are not relevant. For many host adapters, there is no maximum size of a scatter-gather table, and thus clustering computations are wasted CPU cycles.
The existing code doesn't lend itself well to SMP.
The existing code allows requests to grow to the point where they are too large to be handled all at once, and as a result the queuing code must split requests. Really what should be happening is that requests should be prevented from growing this large to begin with.

The fact that many of these shortcomings exist has been known for a considerable time. The problem has always been that until you have a better idea for how to do something, you cannot write a replacement.

New Features

You need to essentially remember that the queue handling is a complete top to bottom rewrite. There are some elements in common, and obviously some of the tasks that must be performed still must be performed. The key is that the whole work flow has been completely reorganized.

One of the main complaints of the existing code was that it was too hierarchical. By this I mean that queue handling is entirely handled by the disk driver, or the cdrom driver. There are hints in the host template which say how many commands the host is capable of handling, whether it supports scatter-gather, etc, but really what needed to be done was that the workflow needed to be inverted. Technically the low-level driver should get control over queue management, and call into the disk driver or the cdrom driver to perform specific tasks as required. It isn't realistic to modify 40 odd low-level drivers to hook in new queuing code, but what is possible is to create a thin layer which can do queue management which sits on top of the low-level driver.

By placing control at this level, it would be quite easy to allow the low-level driver to actually handle the queue if it wished but there isn't the need to modify and possibly break gobs and gobs of drivers. I have not actually put the hooks in place to allow low-level drivers to do this, but it wouldn't be difficult.

Another serious shortcoming was that there was a single monolithic queue for all disks. This has been corrected - there is now one queue per physical device. I am not sure if it really makes any sense to have any finer granularity than this, as it is ultimately the physical device which must respond to requests.

With the new code, the function scsi_request_fn is the queue handler for all scsi devices. The queue itself contains function pointers that are used for queue management (merging requests, etc), so while a single function is used, the actual implementation is tailored to the conditions at hand. The same goes for the preparation of the command and dispatching - function pointers are attached to the Scsi_Device object to handle these tasks. The scsi_request_fn itself is called because a pointer to it is in the queue object - it would be theoretically possible for low-level drivers to define their own function and do their own queue management, but a little work would need to be done to ensure that all of the utility functions which are used in the implementation that I have provided are exported via EXPORT_SYMBOL() such that modules would work.

In order to provide the hooks for queue management, I have defined a request_queue_t datastructure - previously we just used a pointer to the queue head as the "queue" object. This change unfortunately forced me to make relatively trivial modifications to nearly each and every block device driver. There isn't much risk of breakage here - it is pretty mechanical cut-and-paste editing. The other fundamental change is that the request queue function now accepts a parameter, which is a pointer to the actual queue on which we need to start a command. This allows a single function to support more than one queue without the sort of hacks that are evident in the IDE disk driver (there are 7 stub functions, which turn around and call the actual queue handling function with an argument). I haven't removed this redundant code in the IDE driver - I would prefer that someone closer to the ide driver handle this task.

The request_queue_t datatype includes a couple of function pointers for queue management. These are primarily used to decide if a new block can be added to an existing request, and to determine whether two requests can be merged. If these functions are not defined, then the existing queue management functions in ll_rw_blk.c are used instead. For some scsi drivers, it makes no sense to provide special functions - in particular for cases where there is no limit to the scatter-gather table size, and there are no DMA issues, there isn't much sense in redefining any of the worker functions.

Given a single queue per physical device, it becomes much easier to support non-block device requests. Things like ioctl or character device requests can be converted into special types of requests (RQ_SPECIAL) which live in the queue along with the regular block device requests. The queue handling functions know that they are special and obviously will not merge these in any way.

The existing middle level code used to maintain it's own queue of commands which were rejected by the host adapter - usually either because it is unable to handle the additional request, or the device itself cannot accept an additional command. Maintaining a separate queue required special code to check for and handle commands ahead of anything that might be arriving from the normal block device route. With a single queue per device, this special queue can be removed, and commands that have been rejected can instead be pushed onto the front of the queue with the RQ_SPECIAL tag (note that ioctl requests are appended to the end of the queue instead).

One of the final challenges was to provide the actual queue handling functions. There would be serious maintainability problems if we attempted to keep 8 different functions (i.e. if you needed to modify one, then you would need to see whether the other 7 would need the same patch). This turned out to be relatively easy to actually solve - we define a single function which is perfectly general and that can handle all of the different cases. As an example, let us consider the case of a function that might be needed to decide whether two requests might be merged or not. This would depend upon whether the host requested clustering, and whether the host is using unchecked ISA dma or not. The function definition would look something like:

__inline static int __scsi_merge_requests_fn(request_queue_t * q,
					     struct request * req,
					     struct request * next,
					     int use_clustering,
					     int dma_host)
{
	/*
	 * Appropriate contents
	 */
}
Then we define a preprocessor macro:
#define MERGEREQFCT(_FUNCTION, _CLUSTER, _DMA)		\
static int _FUNCTION(request_queue_t * q,		\
		     struct request * req,		\
		     struct request * next)		\
{							\
    return  __scsi_merge_requests_fn(q, req, next, _CLUSTER, _DMA); \
}
Finally, we use the macro to actually generate the functions that we will use:
MERGEREQFCT(scsi_merge_requests_fn_,   0, 0)
MERGEREQFCT(scsi_merge_requests_fn_d,  0, 1)
MERGEREQFCT(scsi_merge_requests_fn_c,  1, 0)
MERGEREQFCT(scsi_merge_requests_fn_dc, 1, 1)
It is quite important that you understand exactly what this is actually doing. The inline function is never intended to be used in any context other than what I have described here - the 4 macro expansions actually define 4 functions. Each function calls the inline functions just once, and we use integer constants as the arguments for dma and clustering. The compiler will be able to optimize away any unused code for the functions that don't need them. By doing it this way, requests headed for PCI drivers are not penalized with having to execute code that has to deal with ISA DMA issues, yet there is a single source from which the functions are derived.

I have a small program which "strips" out the old code (effectively keying off of CONFIG_SCSI_NEW_QUEUES). This could be handy in the event that someone were trying to understand the new code, and was getting confused by the presence of all of the old code. If people request it, I can include a link to this program.

I did all of the design and initial testing in a user-space simulator. This is essentially a user-space program that can be run under gdb, and it is possible to inject requests (at the block_read layer) which are ultimately processed by the scsi_debug host adapter. It is possible to gain insight as to code paths and control flow by stepping in the debugger. If there is interest, I can post the sources, and a precompiled executable. This thing is a real pain in the neck to maintain (getting the thing to link, and faking memory management and task scheduling seems to be kind of fragile as it breaks with every kernel release).

I should point out that no modifications to low-level drivers will be required to use the new queuing code. Ultimately there are a few cleanups which would require low-level modifications, but those issues remain for a future date.

Finally, there is one last important point. The new queuing code is optional in the sense that you must enable it when you configure the kernel. If you do not do this, the old queue handling code will be used instead. There are a couple of reasons for doing this - the main one is that I have no illusions that I will get all of this right the first time around. Thus if people encounter problems they can simply switch back to the old code while we figure out what to do about it. Once things are quite stable, the old queuing code will be removed from the kernel, but until this point people should not get stuck.

 

Current status

1/3/00

The changes were incorporated into the distribution kernels as of linux-2.3.32. A number of problems have been corrected since then which are not available here. Note that it is no longer possible to select/deselect the new queuing code during kernel configuration. It was decided that it complicated things too much to have this choice, and that it would be better to bite the bullet and work through the issues rather than have nobody test the thing.

 

If you have comments or suggestions, you can email me by clicking here

I do not read linux-kernel. All discussions, bug reports, etc, should either be reported directly to me, or to linux-scsi.

Last updated: 1/3/00.