|
|
3. Upper layer
3.0 IntroductionThe upper layer of the SCSI subsystem has the job of taking requests that come from outside of the SCSI subsystem, and turning them into actual SCSI requests. The requests are in turn passed down to the middle layer. Once the command processing is complete, the upper layer receives the status from the middle layer, and in turn the upper layer will notify the external layer of the status. Requests originate from 3 different sources. For block devices, requests originate from the ll_rw_blk layer. For character devices, the requests effectively originate directly from the filesystem code as users attempt to operate on the devices. Finally, the third source of requests is via ioctl - to a large degree this is similar to how character device requests are originated, however ioctls can also be issued to block devices. There is near universal agreement that the current state of the command queuing is inadequate, and it needs to be completely redone. In addition, changes will be needed in the ll_rw_blk layer. The design is not yet complete for the rewrite, but some of the ideas will be discussed in the Future directions section. No matter what happens, the current state of the upper layer is essentially valid for all of the 2.0-2.2 series kernels. While the specific details vary a little bit, the basic tasks of the upper layer are:
3.1 Command queuingQueuing a command is a relatively simple operation. We get a
pointer to a 3.2 Block devices3.2.1 ll_rw_blkTo understand how block device requests are handled, it is first required that you understand the ll_rw_blk layer. To begin with, the ll_rw_blk layer has the job of accumulating I/O requests that originate from filesystems themselves. In order to optimize I/O performance, the ll_rw_blk layer has a number of properties which are of interest to us:
Many of these properties don't suit us well. To work around these problems, a couple of poor design decisions (i.e. hacks) were made.
3.2.2 do_sd_request/do_sr_requestThe SCSI upper layer receives a call from the ll_rw_blk layer - there isn't a specific request attached to this, but it is a general suggestion that there might be something in the queue which can be started. Thus the first thing the SCSI upper layer must do is examine the request queue and see if the request at the head can be queued immediately. If it is queueable, then we allocate a Scsi_Cmd structure which will be used for the request, copy the request into the request field. At the same time, we completely release the original request block so that the ll_rw_blk layer doesn't need to know about which requests in the list might already be active. There is one important caveat here, however. At the time we set up the command block, we look to see how many scatter-gather segments would be required to queue the entire command. If this number exceeds the maximum scatter-gather tablesize, then instead of completely removing the request from the queue, we instead split it in two so that we bite off just as much as we can chew at the moment. Once we have the No matter what, the The decision of whether a command is queueable or not largely depends upon the scsi_allocate_device() or scsi_request_queueable() functions. 3.2.3 requeue_sd_request/requeue_sr_requestThis is the function that actually calculates the physical sectors which the request belongs to. Attempts to read past the end of the disk or attempts to access offline devices should be rejected at this point. The same goes for attempts to access a device for which the media has been changed - we won't allow access until the partition tables have been re-read. The next major task of this function is to make sure that the buffers are suitable for I/O. This will involve generating the scatter-gather table, if required, and also to allocate bounce buffers for host adapters that do DMA over an ISA bus. Finally, the actual SCSI command is generated, and the
mid-layer scsi_do_cmd function is called via the 3.2.4 rw_intrThe post processing has a number of jobs. In the event that there were no errors, then it is just a matter of deallocating any bounce buffers, and deallocating memory for the scatter-gather table. Once this has taken place, the actual buffer status for the buffers that belong to the request must be updated. Once this is done, then the queuing side is called again to see if a new command can now be started. In the event that an error was detected, we attempt to find out how much of the request actually succeeded. Any part that has succeeded is handled in the appropriate way by marking buffers uptodate. The buffers representing blocks after this point are marked not uptodate and then are unlocked. 3.3 Character devicesCharacter devices tend to be quite a bit simpler than the block device drivers. The major difference is that there are different types of entrypoints. For a tape drive, we have a separate entrypoint for read and for write, for example. In addition, the requests always originate from a user process rather than from some intermediate buffer cache or filesystem. The general approach is fairly simple. We start by requesting
a pointer to a The interrupt handler could, if it wanted to, examine the sense data to see if the command completed normally. I should point out that this task could just as easily be done by the process that queued the request, and by doing it this way we could reduce the interrupt latency somewhat. One problem with the current generation of SCSI character device drivers is that they tend to allocate large buffers so as to be guaranteed a DMA-safe location from which data can be transferred. While this does work, it increases the kernel memory footprint. None of the character device drivers attempt to use scatter-gather to reduce the need for the large buffer - in fact, it is entirely possible that for some requests the I/O can be done directly into the pages of user memory which correspond to the buffer. At the time of this writing, an effort is being made to fit the generics driver with the ability to use scatter-gather instead of the "big" buffer that it currently uses. 3.4 IoctlThere isn't a whole lot about an ioctl that is different from the way that requests are handled for character devices. The difference is that an ioctl on a file descriptor might be requesting any one of the following:
This is implemented by having each level (in the order specified above) examine the ioctl command code, process the command if it is something that it recognizes, and pass it down to the next level if the command is something that is not recognized. Other than this, ioctl tends to be done just about the same as a character device request. 3.5 DiskThere is one feature that makes a disk drive different from a
cdrom or any other type of device. Disks tend to have partition tables. This
means that at bootup time we must hook into the data structures that In addition, when a media change is detected, we must disallow
access to the device until the There is also a module related complication - when a device is
removed from the system, the 3.6 CDROMA CDROM has an entirely different set of problems associated with it. They tend to be slow, which isn't really our concern here. If the user is using a changer type of device, they can be *really* slow as the media needs to be flipped in and out of the drive itself. Most of the differences relate to the fact that a CDROM has a lot of additional IOCTL functions which can be used for purposes such as playing audio. In addition, CDROMs have a table of contents (newer ones do, anyways), which is how multi-session discs are supported. Some drives have the ability to read/write audio data directly. Finally, some drives are also CDROM writers, which have the ability to burn a disc as well as reading it. In fact many of these oddball capabilities don't directly have anything to do with the CDROM drive itself. For example, burning a disc with a cd writer tends to involve using the generics interface rather than the cdrom interface. The table of contents does enter into the picture for the CDROM driver, however. The reason for this is that the iso9660 filesystem must have the ability to find the volume descriptor for the last session on the disc, as this is the one which contains the pointer to the root directory that we need to use. The support for reading the table of contents internally uses
the ioctl interface, but the filesystem gains access to this through an entry in
the Some of the ioctl functions that the CDROM driver supports are vendor specific. This is due to the fact that earlier drafts of the standard didn't provide this functionality, but newer drafts do. This there isn't likely to be a code explosion as more and more new drives come on the market. 3.7 TapeIn theory, the tape driver should be simple. Convert a user request into a SCSI command, issue the command, wait for the result and then return. In theory. In practice, it takes a few thousand lines of code to accomplish it all. Much of the reason for this is that there are tons of ioctls related to setting things like blocksize, density, and compression. Also, there are ioctls which skip the tape forward and backward, count filemarks, and so forth. Beyond this, I don't know that much about how the tape driver works. If more information is needed here, perhaps Kai Makisara can fill it in. 3.8 GenericsThe generics driver is yet another case of something which was badly done in the original version, and then patched like crazy to make it somewhat workable. At the moment, there are plans to rewrite portions of the thing in the Linux-2.3 timeframe. Until that time, the thing is going to be incrementally patched to fix some of the remaining problems. Other than this, the generics driver is another classic character mode driver. The difference is that the user specifies the SCSI command that is to be executed, and the user gets the error status back again. 3.9 Future directionsAs you can see from some of the discussion above, a lot about command queuing (especially for block devices) leaves something to be desired. What we have now seems to be fairly stable, but it is very hard to maintain, and I believe that there are performance bottlenecks introduced by some of the design decisions in the current implementation. Here are some of the rough ideas for what we would like for the new version - at the moment, it seems likely that these changes will be available in the early Linux-2.3 timeframe. A lot of this is just preliminary ideas, so don' take any of this as set in stone.
If you have comments or suggestions, you can email me by clicking here Last updated: 4/99.
|