SCSI-3
Home Up SCSI-1 SCSI-2 SCSI-3 SCSI-5 SCSI-6 SCSI-4 SCSI-7 SCSI-8 SCSI-9 SCSI-10 SCSI-13 SCSI-12 SCSI-11

 

 

3. Upper layer

 

3.0 Introduction

The upper layer of the SCSI subsystem has the job of taking requests that come from outside of the SCSI subsystem, and turning them into actual SCSI requests. The requests are in turn passed down to the middle layer. Once the command processing is complete, the upper layer receives the status from the middle layer, and in turn the upper layer will notify the external layer of the status.

Requests originate from 3 different sources. For block devices, requests originate from the ll_rw_blk layer. For character devices, the requests effectively originate directly from the filesystem code as users attempt to operate on the devices. Finally, the third source of requests is via ioctl - to a large degree this is similar to how character device requests are originated, however ioctls can also be issued to block devices.

There is near universal agreement that the current state of the command queuing is inadequate, and it needs to be completely redone. In addition, changes will be needed in the ll_rw_blk layer. The design is not yet complete for the rewrite, but some of the ideas will be discussed in the Future directions section. No matter what happens, the current state of the upper layer is essentially valid for all of the 2.0-2.2 series kernels.

While the specific details vary a little bit, the basic tasks of the upper layer are:
Translate incoming requests into SCSI commands (i.e. READ_10).
Create scatter-gather lists for request.
If the low-level host requires bounce buffers, these must be allocated.
Track usage counts as file descriptors are opened and closed.
Maintain externally visible arrays for device size and block size.
Finally, there is some amount of common glue that is required to make it possible for an upper level driver to be a module.

3.1 Command queuing

Queuing a command is a relatively simple operation. We get a pointer to a Scsi_Cmnd object, fill in a command block with the SCSI command that we want to perform, and then call scsi_do_cmd to queue the request. Once this happens, we merely need to wait for an interrupt to indicate that the request is done. The devil is in the details.

3.2 Block devices

3.2.1 ll_rw_blk

To understand how block device requests are handled, it is first required that you understand the ll_rw_blk layer. To begin with, the ll_rw_blk layer has the job of accumulating I/O requests that originate from filesystems themselves. In order to optimize I/O performance, the ll_rw_blk layer has a number of properties which are of interest to us:
It merges requests for adjacent disk blocks into larger blocks.
It also uses an elevator sorting algorithm to try and arrange in the order of sector number. There is an assumption here that seek times scale with the number of tracks that a disk must move the heads, so the idea is that we try and minimize seek overhead by arranging requests in the order of sector number.
There is one big assumption in the ll_rw_blk layer - it basically assumes that when merging two smaller requests into one larger request that it should never let a request grow to be more than about 200 sectors.
The ll_rw_blk layer assumes that the request at the head of the list is in the process of being handled by the driver itself, so that it will never attempt to merge a new I/O request into the request at the head of the list.
The ll_rw_blk layer assumes that any request past the head of the list is inactive (in that only one request can be queued at one time).
The ll_rw_blk layer attempts to queue a command after each request is inserted into a request, despite the fact that it is known that more requests (which are probably adjacent) still have not been merged into the request lists.

Many of these properties don't suit us well. To work around these problems, a couple of poor design decisions (i.e. hacks) were made.
To begin with, a "plug" hack was added. This was an attempt to solve the last point above, by preventing any requests from being passed down to the upper SCSI layer until all of the incoming blocks have been merged into existing requests, or added to requests of their own.
Secondly, a hack was added to the ll_rw_blk layer whereby new requests can be merged into the request at the head of the list for some major numbers (i.e. SCSI), and not others (i.e. floppy).

3.2.2 do_sd_request/do_sr_request

The SCSI upper layer receives a call from the ll_rw_blk layer - there isn't a specific request attached to this, but it is a general suggestion that there might be something in the queue which can be started. Thus the first thing the SCSI upper layer must do is examine the request queue and see if the request at the head can be queued immediately. If it is queueable, then we allocate a Scsi_Cmd structure which will be used for the request, copy the request into the request field. At the same time, we completely release the original request block so that the ll_rw_blk layer doesn't need to know about which requests in the list might already be active.

There is one important caveat here, however. At the time we set up the command block, we look to see how many scatter-gather segments would be required to queue the entire command. If this number exceeds the maximum scatter-gather tablesize, then instead of completely removing the request from the queue, we instead split it in two so that we bite off just as much as we can chew at the moment.

Once we have the SCpnt, it gets passed to requeue_sd_request.

No matter what, the do_sd_request function will keep looping back to search for more commands that can be queued. The point of this operation is to keep as many devices as possible active at the same time.

The decision of whether a command is queueable or not largely depends upon the scsi_allocate_device() or scsi_request_queueable() functions.

3.2.3 requeue_sd_request/requeue_sr_request

This is the function that actually calculates the physical sectors which the request belongs to. Attempts to read past the end of the disk or attempts to access offline devices should be rejected at this point.

The same goes for attempts to access a device for which the media has been changed - we won't allow access until the partition tables have been re-read.

The next major task of this function is to make sure that the buffers are suitable for I/O. This will involve generating the scatter-gather table, if required, and also to allocate bounce buffers for host adapters that do DMA over an ISA bus.

Finally, the actual SCSI command is generated, and the mid-layer scsi_do_cmd function is called via the scsi_do_cmd function which will actually pass the command down to the actual host adapter. The upper layer will not handle this command again until the interrupt handler calls back up to the rw_intr function for post-processing.

3.2.4 rw_intr

The post processing has a number of jobs. In the event that there were no errors, then it is just a matter of deallocating any bounce buffers, and deallocating memory for the scatter-gather table. Once this has taken place, the actual buffer status for the buffers that belong to the request must be updated.

Once this is done, then the queuing side is called again to see if a new command can now be started.

In the event that an error was detected, we attempt to find out how much of the request actually succeeded. Any part that has succeeded is handled in the appropriate way by marking buffers uptodate.

The buffers representing blocks after this point are marked not uptodate and then are unlocked.

3.3 Character devices

Character devices tend to be quite a bit simpler than the block device drivers. The major difference is that there are different types of entrypoints. For a tape drive, we have a separate entrypoint for read and for write, for example. In addition, the requests always originate from a user process rather than from some intermediate buffer cache or filesystem.

The general approach is fairly simple. We start by requesting a pointer to a Scsi_Cmnd object with the scsi_allocate_device function, call scsi_do_cmd and then block waiting for an interrupt. The interrupt handler has the job of waking up the process that queued the request.

The interrupt handler could, if it wanted to, examine the sense data to see if the command completed normally. I should point out that this task could just as easily be done by the process that queued the request, and by doing it this way we could reduce the interrupt latency somewhat.

One problem with the current generation of SCSI character device drivers is that they tend to allocate large buffers so as to be guaranteed a DMA-safe location from which data can be transferred. While this does work, it increases the kernel memory footprint. None of the character device drivers attempt to use scatter-gather to reduce the need for the large buffer - in fact, it is entirely possible that for some requests the I/O can be done directly into the pages of user memory which correspond to the buffer.

At the time of this writing, an effort is being made to fit the generics driver with the ability to use scatter-gather instead of the "big" buffer that it currently uses.

3.4 Ioctl

There isn't a whole lot about an ioctl that is different from the way that requests are handled for character devices. The difference is that an ioctl on a file descriptor might be requesting any one of the following:
A device-specific operation of some sort.
A device-independent type of operation (that wouldn't depend upon the device being a disk, tape, etc).
A special operation that requests information about the host itself.

This is implemented by having each level (in the order specified above) examine the ioctl command code, process the command if it is something that it recognizes, and pass it down to the next level if the command is something that is not recognized. Other than this, ioctl tends to be done just about the same as a character device request.

3.5 Disk

There is one feature that makes a disk drive different from a cdrom or any other type of device. Disks tend to have partition tables. This means that at bootup time we must hook into the data structures that genhd uses to inform it that a new device exists which may have a partition table.

In addition, when a media change is detected, we must disallow access to the device until the fop_revalidate entrypoint is called - this will typically happen as result of the disk being remounted. Note that we do attempt to lock the door for drives with removable media, so cases where the media is changed while the disk is still mounted are rare. More often than not the situation is that the disk is unmounted, the media is changed, and then mounted again. During the mount procedure, the media change is detected, the buffers flushed, and the new partition table is read. The revalidation code lives in revalidate_scsidisk.

There is also a module related complication - when a device is removed from the system, the genhd datastructures must be cleaned up, and if the disk driver itself is unloaded, the sd_gendisk structure itself must be removed from the gendisk linked list. This is all handled by the cleanup_module entrypoint.

3.6 CDROM

A CDROM has an entirely different set of problems associated with it. They tend to be slow, which isn't really our concern here. If the user is using a changer type of device, they can be *really* slow as the media needs to be flipped in and out of the drive itself.

Most of the differences relate to the fact that a CDROM has a lot of additional IOCTL functions which can be used for purposes such as playing audio. In addition, CDROMs have a table of contents (newer ones do, anyways), which is how multi-session discs are supported. Some drives have the ability to read/write audio data directly. Finally, some drives are also CDROM writers, which have the ability to burn a disc as well as reading it.

In fact many of these oddball capabilities don't directly have anything to do with the CDROM drive itself. For example, burning a disc with a cd writer tends to involve using the generics interface rather than the cdrom interface.

The table of contents does enter into the picture for the CDROM driver, however. The reason for this is that the iso9660 filesystem must have the ability to find the volume descriptor for the last session on the disc, as this is the one which contains the pointer to the root directory that we need to use.

The support for reading the table of contents internally uses the ioctl interface, but the filesystem gains access to this through an entry in the cdrom_device_ops table.

Some of the ioctl functions that the CDROM driver supports are vendor specific. This is due to the fact that earlier drafts of the standard didn't provide this functionality, but newer drafts do. This there isn't likely to be a code explosion as more and more new drives come on the market.

3.7 Tape

In theory, the tape driver should be simple. Convert a user request into a SCSI command, issue the command, wait for the result and then return. In theory. In practice, it takes a few thousand lines of code to accomplish it all.

Much of the reason for this is that there are tons of ioctls related to setting things like blocksize, density, and compression. Also, there are ioctls which skip the tape forward and backward, count filemarks, and so forth.

Beyond this, I don't know that much about how the tape driver works. If more information is needed here, perhaps Kai Makisara can fill it in.

3.8 Generics

The generics driver is yet another case of something which was badly done in the original version, and then patched like crazy to make it somewhat workable. At the moment, there are plans to rewrite portions of the thing in the Linux-2.3 timeframe. Until that time, the thing is going to be incrementally patched to fix some of the remaining problems.

Other than this, the generics driver is another classic character mode driver. The difference is that the user specifies the SCSI command that is to be executed, and the user gets the error status back again.

3.9 Future directions

As you can see from some of the discussion above, a lot about command queuing (especially for block devices) leaves something to be desired. What we have now seems to be fairly stable, but it is very hard to maintain, and I believe that there are performance bottlenecks introduced by some of the design decisions in the current implementation.

Here are some of the rough ideas for what we would like for the new version - at the moment, it seems likely that these changes will be available in the early Linux-2.3 timeframe. A lot of this is just preliminary ideas, so don' take any of this as set in stone.
To begin with, the ll_rw_blk layer needs to be redone. The final design will have some of the following properties:
There will be multiple queues per major number. An API function (one per major) will be used which can quickly look up a pointer to the appropriate queue.
Each queue will have it's own function for queue insertion, which will decide if a new request can be merged into any existing requests. The goal is to prevent the need for splitting requests later on - each request in the queue should be small enough that it can be queued in one shot.
Each queue will have it's own request_fn function, instead of there being one per major number.
On the SCSI end of things, each Scsi_Host object will have it's own queue, queue insertion routine, and request_fn.
Default values for the request_fn will be supplied - the default will be chosen by examining the settings in the Scsi_Host and Scsi_Host_Template objects (such as the scatter-gather size, whether the host does ISA DMA, etc). By supplying a handful of request_fns that can be used, I believe it will be possible to continue to use all of the existing drivers with little or no changes. Note that driver authors that wish to will be able to write custom versions of the request_fn for the queue.
Character device and Ioctl requests will still be generated in a way similar to the way they are done now - the requests will still be passed to scsi_do_cmd or something like it. The difference is that scsi_do_cmd will turn it into a request that can be inserted into the regular queue, and the request can be started via request_fn.
Much of the guts of the block device queuing and the middle layer will be turned around so that the request_fn can act as the driver, and the middle layer will act as passenger.

If you have comments or suggestions, you can email me by clicking here

Last updated: 4/99.