|
|
At the moment a rewrite of the error handling code is underway. To understand the reasons that a rewrite is a good idea, it is important to understand many of the shortcomings of the existing code.
The fact that many of these shortcomings exist has been known for a considerable time. The problem has always been that until you have a better idea for how to do something, you cannot write a replacement. It turns out that pieces of ideas had been floating about amongst a number of Linux developers, and at the Linux Expo in Raleigh NC in April a number of us came together and were able to put together a rough sketch for what a reasonable replacement would be. Later email discussions with other Linux developers resulted in enhancements and refinements to these ideas. The patches that are on this site are a work in progress that implement the ideas that we came up with. There are other areas of the scsi system which need attention. At the moment, the error handling is the highest priority, since this is where things tended to get screwed up the most often. The additional areas that will undergo work in the near future are:
There are lots of ideas floating around about exactly what sorts of changes make sense - as we get closer to actually implementing them the details will be filled in here. Despite the fact that we are only currently working on the error handling code, the patches are rather large. The reason for this is that there are a fairly large number of things which we are trying to accomplish at the same time. Here is a summary of some of the new features. A few will be discussed in more detail below.
The logging facility will make it much easier to troubleshoot problems in the field. The basic idea is that it is possible to request debugging information be enabled at any given instant in time, which will be placed into the kernel log file. The general idea is that there are a number of different facilities within the kernel, each of which can generate messages at varying levels of verbosity. When you have specific problems in one area, you can turn on highly verbose logging for the messages in that subsystem, and leave the rest off, or put some of the others at lower levels of verbosity. Each of the different facilities have names. The valid names are: all, none, timeout, scan, mlqueue, mlcomplete, llqueue, llcomplete, hlqueue, hlcomplete, and ioctl. Each facility can have a verbosity level from 0 to 7, where 0 indicates completely quiet, and 7 represents maximum verbosity. There are two fundamental ways of turning this on - one is via the LILO command line: where N is a number that represents the logging mask. There isn't a really good symbolic way of describing what you are really turning on from the kernel command line - you are essentially specifying a 32 bit quantity that has all of the logging levels stored in it. The easiest way to find the number you actually want is by setting the desired logging level from the /proc filesystem, and then recording the logging mask that is reported. The other way to turn on logging is via the /proc filesystem. A shell command like: will set the verbosity level to 5 for events related to timeout handling. If you are lazy and unsure where the problem is, you can set all of the facilities to the maximum verbosity level with the following command: Similarly, it is possible to turn off all logging with the following command: Note - there is a kernel configuration option that can be used to enable/disable logging. The presence of the logging code shouldn't present any noticeable run-time overhead, as the testing to see whether a given facility is at a sufficiently high logging level is all done inline without any function calls. The only noticeable effect of having the logging enabled is that the kernel will be slightly larger, as a number of printk() and other statements will need to be present. All of the host templates have been reformatted to make it easier to add new functions that might only be used by a few hosts. The fundamental point is that GCC allows structures to be initialized with the following syntax: Syntactically you are able to individually specify the initializer for each element by name. Any elements that aren't specified are initialized to 0 instead. The advantage of this type of initializer is that it would be possible to add a new function pointer to the structure definition for the special purpose use of one particular driver, but none of the other drivers would need source modifications to fill in a new 'NULL' placeholder in the initializer. There are a number of relatively minor changes that will be required to take advantage of the new error handling code. The first and most important is that there is a field in the host template which is a boolean value that indicates whether the driver will be using the old or the new error handling code. To illustrate this, I will give an example of one host template that contains all of the required changes: #endif The boolean that controls whether the old or new error handling code is used is the use_new_eh_code field. There are several other new fields. These are the eh_abort_handler, eh_device_reset_handler, eh_bus_reset_handler, and eh_host_reset_handler fields. Each of these is used by the mid-level code to request a specific action. Each of these takes a single argument, the SCpnt for the command in question (from which you can obtain the host structure, if need be), and each one of these returns either SUCCESS or FAILED. No other return codes are allowed. In addition the mid-level code will not be running a timer to ensure that the function returns within a specific period of time. If the operation could potentially get stuck, it is the responsibility of the low-level routine to add a timer to unblock and clean up. You are not required to add a timer, of course - only if it is possible for the operation to get stuck and never return. You are not required to implement all of the eh_*_handler interfaces. Any that are unspecified are treated as if they returned FAILED, and the error handler thread will go onto the next level. In case it isn't clear from the above, discussion, here are the prototypes of the new function pointers: Finally, the queuecommand interface must correctly return either TRUE or FALSE. TRUE will indicate that the command was successfully queued to the host adapter, FALSE indicates that for some reason the command couldn't be queued, and that the mid-level should try again later. Note - the current patches don't yet support this in the mid-layer, but this is the eventual goal. The main reason for this change is to improve compatibility with BSD derived drivers. Note - this is a work in progress. There will be bugs, but I have tried a number of different host adapters myself in order to try and catch as many as I can. Note2 - The bulk of the patches were added to the 2.1.75 kernel. Currently only the 1542 is using the new error handling code - all other drivers are continuing to use the old error handling code until we gain a bit more confidence in the new code. Download kernel patches (2.1.41) Download kernel patches (2.1.59) Download kernel patches (2.1.71) Patches already applied in 2.1.75 kernel.
If you have comments or suggestions, you can email me by clicking here Last updated: 1/7/98. |