Re: PATCH: Enhance queueing/scsi-midlayer to handle kiobufs. [Re: Request splits]

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Re: PATCH: Enhance queueing/scsi-midlayer to handle kiobufs. [Re: Request splits]
       [not found] ` <200005181955.MAA71492@getafix.engr.sgi.com>
@ 2000-05-19 15:09   ` Stephen C. Tweedie
  2000-05-19 15:48     ` Brian Pomerantz
                       ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Stephen C. Tweedie @ 2000-05-19 15:09 UTC (permalink / raw)
  To: Chaitanya Tumuluri
  Cc: Eric Youngdale, sct, Alan Cox, Douglas Gilbert, Brian Pomerantz,
	linux-scsi, chait, linux-mm

Hi,

On Thu, May 18, 2000 at 12:55:04PM -0700, Chaitanya Tumuluri wrote:

> I've had the same question in my mind. I've also wondered why raw I/O was
> restricted to only KIO_MAX_SECTORS at a time.

Mainly for resource limiting --- you don't want to have too much user
memory pinned permanently at once.

The real solution is probably not to increase the atomic I/O size, but
rather to pipeline I/Os.  That is planned for the future, and now there
are other people interested in it, I'll bump that work up the queue a 
bit!  The idea is to allow brw_kiovec to support fully async operation,
and for the raw device driver to work with multiple kiobufs.  That way
we can keep 2 or 3 kiobufs streaming at all times, eliminating the 
stalls between raw I/O segments, without having to increase the max
segment size.

In the future we'll also need to charge these I/Os against a per-user
limit on pinned memory to control resources.  We can't really offer
O_DIRECT I/O to unprivileged user processes until we have eliminated
that possible DOS attack.  Right now raw devices can only be created
by root and are protected by normal filesystem modes, so we don't
have too much of a problem.

> So, I enhanced Stephen Tweedie's
> raw I/O and the queueing/scsi layers to handle kiobufs-based requests. This is
> in addition to the current buffer_head based request processing.

The "current" kiobuf code is in ftp.uk.linux.org:/pub/linux/sct/fs/raw-io/.
It includes a number of bug fixes (mainly rationalising the error returns),
plus a few new significant bits of functionality.  If you can get me a 
patch against those diffs, I'll include your new code in the main kiobuf
patchset.  (I'm still maintaining the different kiobuf patches as
separate patches within that patchset tarball.)

> Thus, ll_rw_blk.c has two new functions: 
> 	o ll_rw_kio()
> 	o __make_kio_request()

Oh, *thankyou*.  This has been needed for a while.

> Here's the patch against a 2.3.99-pre2 kernel. To recap, two primary reasons
> for this patch:
> 	1. To enhance the queueing and scsi-mid layers to handle kiobuf-based 
> 	   requests as well,
> 
> 	2. Remove request size limits on the upper layers (above ll_rw_blk.c). 
> 	   The KIO_MAX_SECTORS seems to have been inspired by MAX_SECTORS 
> 	   (128 per request) in ll_rw_blk.c. The scsi mid-layer should handle 
> 	   `oversize' requests based on the HBA sg_tablesize.
> 
> I'm not too sure about 2. above; so I'd love to hear from more knowledgeable
> people on that score.

It shouldn't be too much of a problem to retain this limit if the 
brw_kiovec code can stream properly.

> I'd highly appreciate any feedback before I submit this patch `officially'.

There needs to be some mechanism for dealing with drivers which do
not have kiobuf request handling implemented.

I also think that the code is too dependent on request->buffer.  We
_really_ need to treat this as an opportunity to eliminate that field
entirely for kiobuf-based I/Os.  kiobufs refer to struct page *s, not
individual data pointers, and so they can easily represent pages above
the 4GB limit on large memory machines using PAE36.  If we want to be
able to add dual-address-cycle or PCI64 support to the individual scsi
drivers at all, then we need to be able to preserve addresses above
4GB, and kiobufs would seem to be a sensible way to do this if we're
going to have them in the struct request at all.

--Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: PATCH: Enhance queueing/scsi-midlayer to handle kiobufs. [Re: Request splits]
  2000-05-19 15:09   ` PATCH: Enhance queueing/scsi-midlayer to handle kiobufs. [Re: Request splits] Stephen C. Tweedie
@ 2000-05-19 15:48     ` Brian Pomerantz
  2000-05-19 15:55       ` Stephen C. Tweedie
  2000-05-19 17:38     ` Chaitanya Tumuluri
  2000-05-23 21:58     ` Chaitanya Tumuluri
  2 siblings, 1 reply; 12+ messages in thread
From: Brian Pomerantz @ 2000-05-19 15:48 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Chaitanya Tumuluri, Eric Youngdale, Alan Cox, Douglas Gilbert,
	linux-scsi, chait, linux-mm

On Fri, May 19, 2000 at 04:09:58PM +0100, Stephen C. Tweedie wrote:
> Hi,
> 
> On Thu, May 18, 2000 at 12:55:04PM -0700, Chaitanya Tumuluri wrote:
>  
> > I've had the same question in my mind. I've also wondered why raw I/O was
> > restricted to only KIO_MAX_SECTORS at a time.
> 
> Mainly for resource limiting --- you don't want to have too much user
> memory pinned permanently at once.
> 
> The real solution is probably not to increase the atomic I/O size, but
> rather to pipeline I/Os.  That is planned for the future, and now there

That really depends on the device characteristics.  This Ciprico
hardware I've been working with really only performs well if the
atomic I/O size is >= 1MB.  Once you introduce additional transactions
across the bus, your performance drops significantly.  I guess it is a
tradeoff between latency and bandwidth.  Unless you mean the low level
device would be handed a vector of kiobufs and it would build a single
SCSI request with that vector, then I suppose it would work well but
the requests would have to make up a contiguous chunk of drive space.


BAPper
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: PATCH: Enhance queueing/scsi-midlayer to handle kiobufs. [Re: Request splits]
  2000-05-19 15:48     ` Brian Pomerantz
@ 2000-05-19 15:55       ` Stephen C. Tweedie
  2000-05-19 16:17         ` Brian Pomerantz
  2000-05-19 17:53         ` PATCH: Enhance queueing/scsi-midlayer to handle kiobufs. [Re: Request splits] Chaitanya Tumuluri
  0 siblings, 2 replies; 12+ messages in thread
From: Stephen C. Tweedie @ 2000-05-19 15:55 UTC (permalink / raw)
  To: Stephen C. Tweedie, Chaitanya Tumuluri, Eric Youngdale, Alan Cox,
	Douglas Gilbert, linux-scsi, chait, linux-mm

Hi,

On Fri, May 19, 2000 at 08:48:42AM -0700, Brian Pomerantz wrote:

> > The real solution is probably not to increase the atomic I/O size, but
> > rather to pipeline I/Os.  That is planned for the future, and now there
> 
> That really depends on the device characteristics.  This Ciprico
> hardware I've been working with really only performs well if the
> atomic I/O size is >= 1MB.  Once you introduce additional transactions
> across the bus, your performance drops significantly.  I guess it is a
> tradeoff between latency and bandwidth.  Unless you mean the low level
> device would be handed a vector of kiobufs and it would build a single
> SCSI request with that vector,

ll_rw_block can already do that, but...

> then I suppose it would work well but
> the requests would have to make up a contiguous chunk of drive space.

... a single request _must_, by definition, be contiguous.  There is
simply no way for the kernel to deal with non-contiguous atomic I/Os.
I'm not sure what you're talking about here --- how can an atomic I/O
be anything else?  We can do scatter-gather, but only from scattered
memory, not to scattered disk blocks.

--Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: PATCH: Enhance queueing/scsi-midlayer to handle kiobufs. [Re: Request splits]
  2000-05-19 15:55       ` Stephen C. Tweedie
@ 2000-05-19 16:17         ` Brian Pomerantz
  2000-05-19 18:00           ` Chaitanya Tumuluri
  2000-05-19 18:11           ` Gérard Roudier
  2000-05-19 17:53         ` PATCH: Enhance queueing/scsi-midlayer to handle kiobufs. [Re: Request splits] Chaitanya Tumuluri
  1 sibling, 2 replies; 12+ messages in thread
From: Brian Pomerantz @ 2000-05-19 16:17 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Chaitanya Tumuluri, Eric Youngdale, Alan Cox, Douglas Gilbert,
	linux-scsi, chait, linux-mm

On Fri, May 19, 2000 at 04:55:02PM +0100, Stephen C. Tweedie wrote:
> Hi,
> 
> On Fri, May 19, 2000 at 08:48:42AM -0700, Brian Pomerantz wrote:
> 
> > > The real solution is probably not to increase the atomic I/O size, but
> > > rather to pipeline I/Os.  That is planned for the future, and now there
> > 
> > That really depends on the device characteristics.  This Ciprico
> > hardware I've been working with really only performs well if the
> > atomic I/O size is >= 1MB.  Once you introduce additional transactions
> > across the bus, your performance drops significantly.  I guess it is a
> > tradeoff between latency and bandwidth.  Unless you mean the low level
> > device would be handed a vector of kiobufs and it would build a single
> > SCSI request with that vector,
> 
> ll_rw_block can already do that, but...
> 
> > then I suppose it would work well but
> > the requests would have to make up a contiguous chunk of drive space.
> 
> ... a single request _must_, by definition, be contiguous.  There is
> simply no way for the kernel to deal with non-contiguous atomic I/Os.
> I'm not sure what you're talking about here --- how can an atomic I/O
> be anything else?  We can do scatter-gather, but only from scattered
> memory, not to scattered disk blocks.
> 

I may just be confused about how this whole thing works still.  I had
to go change the number of SG segments the QLogic driver allocates and
reports to the SCSI middle layer to a larger number otherwise the
transaction gets split up and I no longer have a single 1MB
transaction but four 256KB transactions.  The number of segments it
was set to was 32 (8KB * 32 = 256KB).  So the question I have is in
the end when you do this pipelining, if you don't increase the atomic
I/O size, will the device attached to the SCSI bus (or FC) still
receive a single request or will it quickly see a bunch of smaller
requests?  My point is, from my experiments with this RAID device, you
will run across situations where it is good to be able to make a
single SCSI request be quite large in order to achieve better
performance.


BAPper
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: PATCH: Enhance queueing/scsi-midlayer to handle kiobufs. [Re: Request splits]
  2000-05-19 15:09   ` PATCH: Enhance queueing/scsi-midlayer to handle kiobufs. [Re: Request splits] Stephen C. Tweedie
  2000-05-19 15:48     ` Brian Pomerantz
@ 2000-05-19 17:38     ` Chaitanya Tumuluri
  2000-05-23 21:58     ` Chaitanya Tumuluri
  2 siblings, 0 replies; 12+ messages in thread
From: Chaitanya Tumuluri @ 2000-05-19 17:38 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Eric Youngdale, Alan Cox, Douglas Gilbert, Brian Pomerantz,
	linux-scsi, chait, linux-mm

On Fri, 19 May 2000 16:09:58 BST, "Stephen C. Tweedie" <sct@redhat.com> wrote:
>Hi,
>
>On Thu, May 18, 2000 at 12:55:04PM -0700, Chaitanya Tumuluri wrote:
> 
>> I've had the same question in my mind. I've also wondered why raw I/O was
>> restricted to only KIO_MAX_SECTORS at a time.
>
>Mainly for resource limiting --- you don't want to have too much user
>memory pinned permanently at once.

The flip side of this argument being that raw I/O is "raw" access to a
device and such access is usually needed by systems (e.g. databases) 
that know what they are doing. I would like to think that in the raw I/O
path at least, we shouldn't be posting such limits. Besides, things like
databases have their "buffer caches" allocated during startup and pinned
till the next reboot.

>The real solution is probably not to increase the atomic I/O size, but
>rather to pipeline I/Os.  That is planned for the future, and now there
>are other people interested in it, I'll bump that work up the queue a 
>bit!  The idea is to allow brw_kiovec to support fully async operation,
>and for the raw device driver to work with multiple kiobufs.  That way
>we can keep 2 or 3 kiobufs streaming at all times, eliminating the 
>stalls between raw I/O segments, without having to increase the max
>segment size.

Sounds good; and its pretty much what I have in mind. And that is why
I've added the following field to the kiobuf struct, to allow the issue
of multiple kiobuf requests from the `parent' kiovec. This would be useful
in the completion functions for the individual kiobufs:

+#if CONFIG_KIOBUF_IO
+       void *k_dev_id;                 /* Store kiovec (or pagebuf) here */
+#endif

However, I'd still like to draw the focus away from the limits imposed at 
the raw I/O (i.e raw.c / buffer.c) layers and say that we should work with 
the real h/w limits. The HBA scatter-gather memory size is the limiting 
factor here. There are two ways of dealing with I/O requests larger than 
this limit:

	1. Repeatedly queue scsi-cmnds against the HBA/device till all
	   the I/O is done. This requeuing is done at the scsi midlayer.

	2. Provide for a "continuation" field in the Scsi_Cmnd struct
	   that low-level HBA drivers understand and use to re-issue
	   the I/Os in chunks of the sg_tablesize. 

The advantage of 2. is obvious in that it doesn't pump the system with
completion interrupts. However, the easiest solution at this point is
1. above (given the mechanism already exists in the scsi midlayers).

>In the future we'll also need to charge these I/Os against a per-user
>limit on pinned memory to control resources.  We can't really offer
>O_DIRECT I/O to unprivileged user processes until we have eliminated
>that possible DOS attack.  Right now raw devices can only be created
>by root and are protected by normal filesystem modes, so we don't
>have too much of a problem.

I hadn't thought that far; but I can see your point, yes.

>> So, I enhanced Stephen Tweedie's
>> raw I/O and the queueing/scsi layers to handle kiobufs-based requests. This is
>> in addition to the current buffer_head based request processing.
>
>The "current" kiobuf code is in ftp.uk.linux.org:/pub/linux/sct/fs/raw-io/.
>It includes a number of bug fixes (mainly rationalising the error returns),
>plus a few new significant bits of functionality.  If you can get me a 
>patch against those diffs, I'll include your new code in the main kiobuf
>patchset.  (I'm still maintaining the different kiobuf patches as
>separate patches within that patchset tarball.)

Great...I'll work on it shortly. 

>> Thus, ll_rw_blk.c has two new functions: 
>> 	o ll_rw_kio()
>> 	o __make_kio_request()
>
>Oh, *thankyou*.  This has been needed for a while.

Now there's a real gratifying response for you! Thank _you_! :^)
Will you consider including these changes also (i.e. queueing/scsi midlayers)
in your patchset as well?

>> Here's the patch against a 2.3.99-pre2 kernel. To recap, two primary reasons
>> for this patch:
>> 	1. To enhance the queueing and scsi-mid layers to handle kiobuf-based 
>> 	   requests as well,
>> 
>> 	2. Remove request size limits on the upper layers (above ll_rw_blk.c). 
>> 	   The KIO_MAX_SECTORS seems to have been inspired by MAX_SECTORS 
>> 	   (128 per request) in ll_rw_blk.c. The scsi mid-layer should handle 
>> 	   `oversize' requests based on the HBA sg_tablesize.
>> 
>> I'm not too sure about 2. above; so I'd love to hear from more knowledgeable
>> people on that score.
>
>It shouldn't be too much of a problem to retain this limit if the 
>brw_kiovec code can stream properly.
>
>> I'd highly appreciate any feedback before I submit this patch `officially'.
>
>There needs to be some mechanism for dealing with drivers which do
>not have kiobuf request handling implemented.

They will continue working with the current buffer_head path. That is why
as far as possible, I've separated the buffer_head and kiobuf request
handling into separate functions in the code. This is also the reason I
still have the #ifdefs in the code.....it enables easier surgery when the
time comes (if and when!) to remove the buffer_head I/O paths completely.

<plug>
I've experimented with XFS performance using kiobuf-based requests and it
has shown performance improvements ... data is still not stable yet. The
main improvements (as expected) is in lowered CPU overheads and slightly
improved disk thro'puts.

It'd be great if we could sit down and convert ext2 to use kiobufs/kiovecs
and see the difference.
<\plug>

>I also think that the code is too dependent on request->buffer.  We
>_really_ need to treat this as an opportunity to eliminate that field
>entirely for kiobuf-based I/Os.  kiobufs refer to struct page *s, not
>individual data pointers, and so they can easily represent pages above
>the 4GB limit on large memory machines using PAE36.  If we want to be
>able to add dual-address-cycle or PCI64 support to the individual scsi
>drivers at all, then we need to be able to preserve addresses above
>4GB, and kiobufs would seem to be a sensible way to do this if we're
>going to have them in the struct request at all.

True and thats been the reasoning behind the "pagebuf" efforts currently
being used in the XFS work. 

I'll download your rawio source and merge my changes into that source.

Cheers,
-Chait.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: PATCH: Enhance queueing/scsi-midlayer to handle kiobufs. [Re: Request splits]
  2000-05-19 15:55       ` Stephen C. Tweedie
  2000-05-19 16:17         ` Brian Pomerantz
@ 2000-05-19 17:53         ` Chaitanya Tumuluri
  1 sibling, 0 replies; 12+ messages in thread
From: Chaitanya Tumuluri @ 2000-05-19 17:53 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Eric Youngdale, Alan Cox, Douglas Gilbert, linux-scsi, chait, linux-mm

On Fri, 19 May 2000 16:55:02 BST, "Stephen C. Tweedie" <sct@redhat.com> wrote:
>Hi,
>
>On Fri, May 19, 2000 at 08:48:42AM -0700, Brian Pomerantz wrote:
>
>> > The real solution is probably not to increase the atomic I/O size, but
>> > rather to pipeline I/Os.  That is planned for the future, and now there
>> 
>> That really depends on the device characteristics.  This Ciprico
>> hardware I've been working with really only performs well if the
>> atomic I/O size is >= 1MB.  Once you introduce additional transactions
>> across the bus, your performance drops significantly.  I guess it is a
>> tradeoff between latency and bandwidth.  Unless you mean the low level
>> device would be handed a vector of kiobufs and it would build a single
>> SCSI request with that vector,

Hmm...I was thinking more along the lines of kiobuf abstraction being limited
to the scsi midlayer and the low-level device (HBA/disk driver) being handed
a linked list of Scsi_Cmnds, each containing at most the HBA sg_tablesize of
I/O. The chaining of such Scsi_Cmnd structs is not possible currently and 
might be the way to go. Each Scsi_Cmnd in the chain would represent one 
kiobuf-based I/O request at a time. 

>ll_rw_block can already do that, but...
>
>> then I suppose it would work well but
>> the requests would have to make up a contiguous chunk of drive space.
>
>... a single request _must_, by definition, be contiguous.  There is
>simply no way for the kernel to deal with non-contiguous atomic I/Os.
>I'm not sure what you're talking about here --- how can an atomic I/O
>be anything else?  We can do scatter-gather, but only from scattered
>memory, not to scattered disk blocks.

And that could potentially be handled via the linked list of Scsi_Cmnd 
structs that I mention above. Each I/O within a Scsi_Cmnd would be restricted 
to contiguous disk blocks but that needn't apply across the linked list.

Cheers,
-Chait.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: PATCH: Enhance queueing/scsi-midlayer to handle kiobufs. [Re: Request splits]
  2000-05-19 16:17         ` Brian Pomerantz
@ 2000-05-19 18:00           ` Chaitanya Tumuluri
  2000-05-19 18:11           ` Gérard Roudier
  1 sibling, 0 replies; 12+ messages in thread
From: Chaitanya Tumuluri @ 2000-05-19 18:00 UTC (permalink / raw)
  To: Brian Pomerantz
  Cc: Stephen C. Tweedie, Eric Youngdale, Alan Cox, Douglas Gilbert,
	linux-scsi, chait, linux-mm

On Fri, 19 May 2000 09:17:18 PDT, Brian Pomerantz <bapper@piratehaven.org> wrote:
>
>		< stuff snipped >
>
>was set to was 32 (8KB * 32 = 256KB).  So the question I have is in
>the end when you do this pipelining, if you don't increase the atomic
>I/O size, will the device attached to the SCSI bus (or FC) still
>receive a single request or will it quickly see a bunch of smaller
>requests?  My point is, from my experiments with this RAID device, you
>will run across situations where it is good to be able to make a
>single SCSI request be quite large in order to achieve better
>performance.

Agreed. And the patch I've suggested to this list does exactly that. It
allows you to issue large I/Os and the scsi midlayers will take care of
the device sg_tablesize limitations and split/re-issue the large I/O 
into smaller sg_tablesize I/Os till the entire request is done. 

So, the limitation (at least in the rawio path) would only be the HBA 
sg_tablesize. You wouldn't even have to endure the wait in the request
queue, since these multiple sg_tablesize requests would be inserted at
the head of the queue and the dispatch function for the queue called
immediately (i.e. no _undue_ plugging/unplugging of the device queues).

Cheers,
-Chait.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: PATCH: Enhance queueing/scsi-midlayer to handle kiobufs. [Re: Request splits]
  2000-05-19 16:17         ` Brian Pomerantz
  2000-05-19 18:00           ` Chaitanya Tumuluri
@ 2000-05-19 18:11           ` Gérard Roudier
  2000-05-19 19:24             ` Brian Pomerantz
  1 sibling, 1 reply; 12+ messages in thread
From: Gérard Roudier @ 2000-05-19 18:11 UTC (permalink / raw)
  To: Brian Pomerantz
  Cc: Stephen C. Tweedie, Chaitanya Tumuluri, Eric Youngdale, Alan Cox,
	Douglas Gilbert, linux-scsi, chait, linux-mm

On Fri, 19 May 2000, Brian Pomerantz wrote:

> On Fri, May 19, 2000 at 04:55:02PM +0100, Stephen C. Tweedie wrote:
> > Hi,
> > 
> > On Fri, May 19, 2000 at 08:48:42AM -0700, Brian Pomerantz wrote:
> > 
> > > > The real solution is probably not to increase the atomic I/O size, but
> > > > rather to pipeline I/Os.  That is planned for the future, and now there
> > > 
> > > That really depends on the device characteristics.  This Ciprico
> > > hardware I've been working with really only performs well if the
> > > atomic I/O size is >= 1MB.  Once you introduce additional transactions

Hmmm... SCSI allows up to 30,000 transactions per second and 15,000 T/s is
observed with current technology. This allows to be comfortable with
Ultra-320 even with using not too large transactions.

This let me claim that this 'Ciprico' should be damned shitty design or
implementation of a SCSI device.

Using very large scatterlists may complexify a lot SCSI sub-system and
drivers or let them oger memory for their memory pool. This Ciprico does
not deserve that we add penalty and bloatage in our software, in my
opinion.  The only reasonnable approach could be to use some peripheral
driver that can try to allocate a huge mostly contiguous chunk of memory
for the Ciprico, but to leave quiet our software that does fit nicely the
needs of reasonnably designed and implemented SCSI devices.

> > > across the bus, your performance drops significantly.  I guess it is a
> > > tradeoff between latency and bandwidth.  Unless you mean the low level
> > > device would be handed a vector of kiobufs and it would build a single
> > > SCSI request with that vector,
> > 
> > ll_rw_block can already do that, but...
> > 
> > > then I suppose it would work well but
> > > the requests would have to make up a contiguous chunk of drive space.
> > 
> > ... a single request _must_, by definition, be contiguous.  There is
> > simply no way for the kernel to deal with non-contiguous atomic I/Os.
> > I'm not sure what you're talking about here --- how can an atomic I/O
> > be anything else?  We can do scatter-gather, but only from scattered
> > memory, not to scattered disk blocks.
> > 
> 
> I may just be confused about how this whole thing works still.  I had
> to go change the number of SG segments the QLogic driver allocates and
> reports to the SCSI middle layer to a larger number otherwise the
> transaction gets split up and I no longer have a single 1MB
> transaction but four 256KB transactions.  The number of segments it
> was set to was 32 (8KB * 32 = 256KB).  So the question I have is in
> the end when you do this pipelining, if you don't increase the atomic
> I/O size, will the device attached to the SCSI bus (or FC) still
> receive a single request or will it quickly see a bunch of smaller
> requests?  My point is, from my experiments with this RAID device, you
> will run across situations where it is good to be able to make a
> single SCSI request be quite large in order to achieve better
> performance.

Low-level drivers have limits on number of scatter entries. They can do
large transfers if scatter entries point to large data area. Rather than
hacking low-level drivers that are very critical piece of code that
require specific knowledge and documentation about the hardware, I
recommend you to hack the peripheral driver used for the Ciprico and let
it use large contiguous buffers (If obviously you want to spend your time
for this device that should go to compost, IMO).

Wanting to provide best support for shitty designed hardware does not
encourage hardware vendors to provide us with well designed hardware. In
others words, the more we want to support crap, the more we will have to
support crap.

Gerard.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: PATCH: Enhance queueing/scsi-midlayer to handle kiobufs. [Re: Request splits]
  2000-05-19 18:11           ` Gérard Roudier
@ 2000-05-19 19:24             ` Brian Pomerantz
  2000-05-19 20:43               ` Gérard Roudier
  0 siblings, 1 reply; 12+ messages in thread
From: Brian Pomerantz @ 2000-05-19 19:24 UTC (permalink / raw)
  To: Gérard Roudier
  Cc: Stephen C. Tweedie, Chaitanya Tumuluri, Eric Youngdale, Alan Cox,
	Douglas Gilbert, linux-scsi, chait, linux-mm

On Fri, May 19, 2000 at 08:11:10PM +0200, Gerard Roudier wrote:
> 
> Low-level drivers have limits on number of scatter entries. They can do
> large transfers if scatter entries point to large data area. Rather than
> hacking low-level drivers that are very critical piece of code that
> require specific knowledge and documentation about the hardware, I
> recommend you to hack the peripheral driver used for the Ciprico and let
> it use large contiguous buffers (If obviously you want to spend your time
> for this device that should go to compost, IMO).
> 
> Wanting to provide best support for shitty designed hardware does not
> encourage hardware vendors to provide us with well designed hardware. In
> others words, the more we want to support crap, the more we will have to
> support crap.
> 

I really don't want to get into a pissing match but it is obvious to
me that you haven't had any experience with high performance external
RAID solutions.  You will see this sort of performance characteristic
with a lot of these devices.  They are used to put together very large
storage systems (we are looking at building petabtye systems within
two years).

There is no way I'm going to hack on the proprietary RAID controller
in this system and even if I wanted to or could, I'm quite certain
there is a reason for needing the large transaction size.  When you
take into account the maximum transaction unit on each drive (usually
64KB), the fact that there are 8 data drives, then you have parity
calculation, latency of the transfer, and various points at which data
is cached and queued up before there is a complete transaction, then
you come up with the magic number.  These things were designed for
large streaming data and they do it VERY well.

If you have a hardware RAID 3 or RAID 5 solution that will give me
this kind of performance for the price point and size that these
Ciprico units have, then I would LOVE to hear it because I'm in the
market for buying several of them.  If you can find me a way of
getting >= 150MB/s streaming I/O on a single I/O server for my cluster
and fill an order for 2 I/O servers for under $100K, then I may
consider something other than the Ciprico 7000.  They deliver this
performance for a very attractive price.  And in a year, I'll come
back and buy a fifty more.

BAPper
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: PATCH: Enhance queueing/scsi-midlayer to handle kiobufs. [Re: Request splits]
  2000-05-19 19:24             ` Brian Pomerantz
@ 2000-05-19 20:43               ` Gérard Roudier
  2000-05-20  9:10                 ` Change direct I/O memory model? [Was Re: PATCH: Enhance queueing/scsi-midlayer to handle kiobufs] Mark Mokryn
  0 siblings, 1 reply; 12+ messages in thread
From: Gérard Roudier @ 2000-05-19 20:43 UTC (permalink / raw)
  To: Brian Pomerantz
  Cc: Stephen C. Tweedie, Chaitanya Tumuluri, Eric Youngdale, Alan Cox,
	Douglas Gilbert, linux-scsi, chait, linux-mm

On Fri, 19 May 2000, Brian Pomerantz wrote:

> On Fri, May 19, 2000 at 08:11:10PM +0200, Gerard Roudier wrote:
> > 
> > Low-level drivers have limits on number of scatter entries. They can do
> > large transfers if scatter entries point to large data area. Rather than
> > hacking low-level drivers that are very critical piece of code that
> > require specific knowledge and documentation about the hardware, I
> > recommend you to hack the peripheral driver used for the Ciprico and let
> > it use large contiguous buffers (If obviously you want to spend your time
> > for this device that should go to compost, IMO).
> > 
> > Wanting to provide best support for shitty designed hardware does not
> > encourage hardware vendors to provide us with well designed hardware. In
> > others words, the more we want to support crap, the more we will have to
> > support crap.
> > 
> 
> I really don't want to get into a pissing match but it is obvious to
> me that you haven't had any experience with high performance external
> RAID solutions.  You will see this sort of performance characteristic
> with a lot of these devices.  They are used to put together very large
> storage systems (we are looking at building petabtye systems within
> two years).
> 
> There is no way I'm going to hack on the proprietary RAID controller
> in this system and even if I wanted to or could, I'm quite certain
> there is a reason for needing the large transaction size.  When you
> take into account the maximum transaction unit on each drive (usually
> 64KB), the fact that there are 8 data drives, then you have parity
> calculation, latency of the transfer, and various points at which data
> is cached and queued up before there is a complete transaction, then
> you come up with the magic number.  These things were designed for
> large streaming data and they do it VERY well.
> 
> If you have a hardware RAID 3 or RAID 5 solution that will give me
> this kind of performance for the price point and size that these
> Ciprico units have, then I would LOVE to hear it because I'm in the
> market for buying several of them.  If you can find me a way of
> getting >= 150MB/s streaming I/O on a single I/O server for my cluster
> and fill an order for 2 I/O servers for under $100K, then I may
> consider something other than the Ciprico 7000.  They deliver this
> performance for a very attractive price.  And in a year, I'll come
> back and buy a fifty more.

The SCSI BUS is transaction based and shared. I donnot care of affordable
stuff that moves needless burden to other parts. If they need specific
support for their hardware they must pay for that, and in this situation
these products would probably not really be so affordable.

Note that the same pathology happens to PCI technology. Some PCI devices,
notably Video boards, IDE controllers, brigdes, Network boards, have
abused a LOT of this BUS too. These hardware are/were also probably for an
interesting price but made shit in that place too.

The 36 bit adress extension from Intel is of the same idiomania that
costed a lot for a pathetic result, it seems. It adds complexity to VM
handling on Intel 32 bit systems when 64 bit have been proven to be
feasible 10 years ago and works fine.

I am only interested in technical issues but, in my opinion, it is the
ones that induce costs that must pay for these costs and not others. A
device that requires special handling because it abuses of technologies it
uses should be discarded unless vendor want to pay for the additionnal
effort that such offending stuff requires. But, if such vendors have
interested customers that are ready to pay for the effort or to spend time
implementing specific stuff for these products, I donnot see any problems.

A low-latency BUS does not require huge transactions in order to allow to
use efficiently its bandwitch. If a device requires so, then it can only
be badly designed. Given you description of the Capricio, it seems that it
is based on some kind of stupid batch mode that looks extremally poor
design to me.

Gerard.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Change direct I/O memory model? [Was Re: PATCH: Enhance queueing/scsi-midlayer to handle kiobufs]
  2000-05-19 20:43               ` Gérard Roudier
@ 2000-05-20  9:10                 ` Mark Mokryn
  0 siblings, 0 replies; 12+ messages in thread
From: Mark Mokryn @ 2000-05-20  9:10 UTC (permalink / raw)
  To: linux-scsi
  Cc: Brian Pomerantz, Stephen C. Tweedie, Chaitanya Tumuluri,
	Eric Youngdale, Alan Cox, Douglas Gilbert, chait, linux-mm

One comment about direct I/O: Applications performing direct I/O usually know
what they're doing...And if they don't, well, it's their problem: that's what
direct I/O is all about. Therefore, perhaps the correct approach would be to
allow these apps to manipulate the memory as they see fit (with the standard
memory manipulation routines - mmap, mlock, whatever).  I don't see a logical
reason for the direct I/O stuff to be intricately tied in with memory management
(i.e. limiting atomic I/O size, etc.). Direct I/O should receive a nicely aligned
physical buffer (or a sg list of such buffers) which has already been locked in
memory by the application, and perform the direct I/O. That's it - very simple.
Note that in this case, the memory in question could be a user buffer, or it may
even be a mapping to PCI memory sitting on another device, thus allowing adapter
to adapter data transfers. Note that in NT, when performing unbuffered I/O, it is
up to the user to lock the memory in question to sector-sized boundaries. This is
the correct way to go...

Mark

Gerard Roudier wrote:

> On Fri, 19 May 2000, Brian Pomerantz wrote:
>
> > On Fri, May 19, 2000 at 08:11:10PM +0200, Gerard Roudier wrote:
> > >
> > > Low-level drivers have limits on number of scatter entries. They can do
> > > large transfers if scatter entries point to large data area. Rather than
> > > hacking low-level drivers that are very critical piece of code that
> > > require specific knowledge and documentation about the hardware, I
> > > recommend you to hack the peripheral driver used for the Ciprico and let
> > > it use large contiguous buffers (If obviously you want to spend your time
> > > for this device that should go to compost, IMO).
> > >
> > > Wanting to provide best support for shitty designed hardware does not
> > > encourage hardware vendors to provide us with well designed hardware. In
> > > others words, the more we want to support crap, the more we will have to
> > > support crap.
> > >
> >
> > I really don't want to get into a pissing match but it is obvious to
> > me that you haven't had any experience with high performance external
> > RAID solutions.  You will see this sort of performance characteristic
> > with a lot of these devices.  They are used to put together very large
> > storage systems (we are looking at building petabtye systems within
> > two years).
> >
> > There is no way I'm going to hack on the proprietary RAID controller
> > in this system and even if I wanted to or could, I'm quite certain
> > there is a reason for needing the large transaction size.  When you
> > take into account the maximum transaction unit on each drive (usually
> > 64KB), the fact that there are 8 data drives, then you have parity
> > calculation, latency of the transfer, and various points at which data
> > is cached and queued up before there is a complete transaction, then
> > you come up with the magic number.  These things were designed for
> > large streaming data and they do it VERY well.
> >
> > If you have a hardware RAID 3 or RAID 5 solution that will give me
> > this kind of performance for the price point and size that these
> > Ciprico units have, then I would LOVE to hear it because I'm in the
> > market for buying several of them.  If you can find me a way of
> > getting >= 150MB/s streaming I/O on a single I/O server for my cluster
> > and fill an order for 2 I/O servers for under $100K, then I may
> > consider something other than the Ciprico 7000.  They deliver this
> > performance for a very attractive price.  And in a year, I'll come
> > back and buy a fifty more.
>
> The SCSI BUS is transaction based and shared. I donnot care of affordable
> stuff that moves needless burden to other parts. If they need specific
> support for their hardware they must pay for that, and in this situation
> these products would probably not really be so affordable.
>
> Note that the same pathology happens to PCI technology. Some PCI devices,
> notably Video boards, IDE controllers, brigdes, Network boards, have
> abused a LOT of this BUS too. These hardware are/were also probably for an
> interesting price but made shit in that place too.
>
> The 36 bit adress extension from Intel is of the same idiomania that
> costed a lot for a pathetic result, it seems. It adds complexity to VM
> handling on Intel 32 bit systems when 64 bit have been proven to be
> feasible 10 years ago and works fine.
>
> I am only interested in technical issues but, in my opinion, it is the
> ones that induce costs that must pay for these costs and not others. A
> device that requires special handling because it abuses of technologies it
> uses should be discarded unless vendor want to pay for the additionnal
> effort that such offending stuff requires. But, if such vendors have
> interested customers that are ready to pay for the effort or to spend time
> implementing specific stuff for these products, I donnot see any problems.
>
> A low-latency BUS does not require huge transactions in order to allow to
> use efficiently its bandwitch. If a device requires so, then it can only
> be badly designed. Given you description of the Capricio, it seems that it
> is based on some kind of stupid batch mode that looks extremally poor
> design to me.
>
> Gerard.
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majordomo@vger.rutgers.edu

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: PATCH: Enhance queueing/scsi-midlayer to handle kiobufs. [Re: Request splits]
  2000-05-19 15:09   ` PATCH: Enhance queueing/scsi-midlayer to handle kiobufs. [Re: Request splits] Stephen C. Tweedie
  2000-05-19 15:48     ` Brian Pomerantz
  2000-05-19 17:38     ` Chaitanya Tumuluri
@ 2000-05-23 21:58     ` Chaitanya Tumuluri
  2 siblings, 0 replies; 12+ messages in thread
From: Chaitanya Tumuluri @ 2000-05-23 21:58 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: chait, Eric Youngdale, Alan Cox, Douglas Gilbert,
	Brian Pomerantz, linux-scsi, linux-mm

On Fri, 19 May 2000 16:09:58 BST, "Stephen C. Tweedie" <sct@redhat.com> wrote:
>Hi,
>
>On Thu, May 18, 2000 at 12:55:04PM -0700, Chaitanya Tumuluri wrote:
> 
>		< stuff deleted >
>
>> So, I enhanced Stephen Tweedie's
>> raw I/O and the queueing/scsi layers to handle kiobufs-based requests. This is
>> in addition to the current buffer_head based request processing.
>
>The "current" kiobuf code is in ftp.uk.linux.org:/pub/linux/sct/fs/raw-io/.
>It includes a number of bug fixes (mainly rationalising the error returns),
>plus a few new significant bits of functionality.  If you can get me a 
>patch against those diffs, I'll include your new code in the main kiobuf
>patchset.  (I'm still maintaining the different kiobuf patches as
>separate patches within that patchset tarball.)
>

Stephen and others,

Here's my patch against the 2.3.99.pre9-2 patchset from your site. The main
differences from my earlier post are:
	- removed the #ifdefs around my code as Stephen Tweedie suggested,
	- corrected indentation problems pointed out earlier (Eric/Alan).
Finally, I'd like to repeat that given the consensus about moving away from
buffer-head based I/O in the future, it makes sense for me to retain the 
little bit of code duplication. This is in the interests of easy surgery
when we do remove the buffer-head I/O paths.

While I see decent (upto 10%) improvement in b/w and turnaround time for
I/O to a single disk, the biggest impact is the (almost 40%) reduction
in CPU utilization with the new codepath. These are from simple `lmdd' tests 
timed with /usr/bin/time.

Based on further feedback from this audience, I would like to propose this 
change to Linus at some point as a general scsi mechanism to handle 
kiobuf-based requests.

Thanks much,
-Chait.

----------------------------CUT HERE---------------------------------------

--- pre9.2-sct/drivers/block/ll_rw_blk.c	Tue May 23 14:24:22 2000
+++ pre9.2-sct+mine/drivers/block/ll_rw_blk.c	Tue May 23 14:38:20 2000
@@ -4,6 +4,7 @@
  * Copyright (C) 1991, 1992 Linus Torvalds
  * Copyright (C) 1994,      Karl Keyte: Added support for disk statistics
  * Elevator latency, (C) 2000  Andrea Arcangeli <andrea@suse.de> SuSE
+ * Support for kiobuf-based I/O requests: Chaitanya Tumuluri [chait@sgi.com]
  */
 
 /*
@@ -639,7 +640,8 @@
 			starving = 1;
 		if (latency < 0)
 			continue;
-
+		if (req->kiobuf)
+			continue;
 		if (req->sem)
 			continue;
 		if (req->cmd != rw)
@@ -744,6 +746,7 @@
 	req->nr_hw_segments = 1; /* Always 1 for a new request. */
 	req->buffer = bh->b_data;
 	req->sem = NULL;
+	req->kiobuf = NULL; 
 	req->bh = bh;
 	req->bhtail = bh;
 	req->q = q;
@@ -886,6 +889,311 @@
 	__ll_rw_block(rw, nr, bh, 1);
 }
 
+/*
+ * Function:    __make_kio_request()
+ *
+ * Purpose:     Construct a kiobuf-based request and insert into request queue.
+ *
+ * Arguments:   q	- request queue of device
+ *              rw      - read/write
+ *              kiobuf  - collection of pages 
+ *		dev	- device against which I/O requested
+ *		blocknr - dev block number at which to start I/O
+ *              blksize - units (512B or other) of blocknr
+ *
+ * Lock status: No lock held upon entry.
+ *  
+ * Returns:     Nothing
+ *
+ * Notes:       Requests generated by this function should _NOT_  be merged by
+ *  		the __make_request() (new check for `req->kiobuf')
+ *
+ *		All (relevant) req->Y parameters are expressed in sector size 
+ *		of 512B for kiobuf based I/O. This is assumed in the scsi
+ *		mid-layer as well.
+ */	
+static inline void __make_kio_request(request_queue_t * q,
+				      int rw,
+				      struct kiobuf * kiobuf,
+				      kdev_t dev,
+				      unsigned long blocknr,
+				      size_t blksize)
+{
+	int major = MAJOR(dev);
+	unsigned int sector, count, nr_bytes, total_bytes, nr_seg;
+	struct request * req;
+	int rw_ahead, max_req;
+	unsigned long flags;
+	struct list_head * head = &q->queue_head;
+	size_t curr_offset;
+	int orig_latency;
+	elevator_t * elevator;
+	int correct_size, i, kioind;
+	
+	/*
+	 * Sanity Tests:	
+	 *
+	 * The input arg. `blocknr' is in units of the 
+	 * input arg. `blksize' (inode->i_sb->s_blocksize).
+	 * Convert to 512B unit used in blk_size[] array.
+	 */
+	count = kiobuf->length >> 9; 
+	sector = blocknr * (blksize >> 9); 
+
+	if (blk_size[major]) {
+		unsigned long maxsector = (blk_size[major][MINOR(dev)] << 1) + 1;
+
+		if (maxsector < count || maxsector - count < sector) {
+			if (!blk_size[major][MINOR(dev)]) {
+				kiobuf->errno = -EINVAL;
+				goto end_io;
+			}
+			/* This may well happen - the kernel calls bread()
+			   without checking the size of the device, e.g.,
+			   when mounting a device. */
+			printk(KERN_INFO
+				"attempt to access beyond end of device\n");
+			printk(KERN_INFO "%s: rw=%d, want=%d, limit=%d\n",
+				kdevname(dev), rw,
+			       (sector + count)>>1,
+			       blk_size[major][MINOR(dev)]);
+			kiobuf->errno = -ESPIPE;
+			goto end_io;
+		}
+	}
+	/*
+	 * Allow only basic block size multiples in the
+	 * kiobuf->length. 
+	 */
+	correct_size = BLOCK_SIZE;
+	if (blksize_size[major]) {
+		i = blksize_size[major][MINOR(dev)];
+		if (i)
+			correct_size = i;
+	}
+	if ((kiobuf->length % correct_size) != 0) {
+		printk(KERN_NOTICE "ll_rw_kio: "
+		       "request size [%d] not a multiple of device [%s] block-size [%d]\n",
+		       kiobuf->length,
+		       kdevname(dev),
+		       correct_size);
+		kiobuf->errno = -EINVAL;
+		goto end_io;
+	}
+	rw_ahead = 0;	/* normal case; gets changed below for READA */
+	switch (rw) {
+		case READA:
+			rw_ahead = 1;
+			rw = READ;	/* drop into READ */
+		case READ:
+			kstat.pgpgin++;
+			max_req = NR_REQUEST;	/* reads take precedence */
+			break;
+		case WRITERAW:
+			rw = WRITE;
+			goto do_write;	/* Skip the buffer refile */
+		case WRITE:
+		do_write:
+			/*
+			 * We don't allow the write-requests to fill up the
+			 * queue completely:  we want some room for reads,
+			 * as they take precedence. The last third of the
+			 * requests are only for reads.
+			 */
+			kstat.pgpgout++;
+			max_req = (NR_REQUEST * 2) / 3;
+			break;
+		default:
+			BUG();
+			kiobuf->errno = -EINVAL;
+			goto end_io;
+	}
+
+	/*
+	 * Creation of bounce buffers for data in high memory
+	 * should (is) be handled lower in the food-chain.
+	 * Ccurrently done in scsi_merge.c for scsi disks.
+	 *
+	 * Look for a free request with spinlock held.
+	 * Apart from atomic queue access, it prevents
+	 * another thread that has already queued a kiobuf-request
+	 * into this queue from starting it, till we are done.
+	 */
+	elevator = &q->elevator;
+	orig_latency = elevator_request_latency(elevator, rw);
+	spin_lock_irqsave(&io_request_lock,flags);
+	
+	if (list_empty(head))
+		q->plug_device_fn(q, dev);
+	/*
+	 * The scsi disk and cdrom drivers completely remove the request
+	 * from the queue when they start processing an entry.  For this
+	 * reason it is safe to continue to add links to the top entry
+	 * for those devices.
+	 *
+	 * All other drivers need to jump over the first entry, as that
+	 * entry may be busy being processed and we thus can't change
+	 * it.
+	 */
+	if (q->head_active && !q->plugged)
+		head = head->next;
+
+	/* find an unused request. */
+	req = get_request(max_req, dev);
+
+	/*
+	 * if no request available: if rw_ahead, forget it,
+	 * otherwise try again blocking..
+	 */
+	if (!req) {
+		spin_unlock_irqrestore(&io_request_lock,flags);
+		if (rw_ahead){
+			kiobuf->errno = -EBUSY;
+			goto end_io;
+		}
+		req = __get_request_wait(max_req, dev);
+		spin_lock_irqsave(&io_request_lock,flags);
+
+		/* revalidate elevator */
+		head = &q->queue_head;
+		if (q->head_active && !q->plugged)
+			head = head->next;
+	}
+
+	/* fill up the request-info, and add it to the queue */
+	req->cmd = rw;
+	req->errors = 0;
+	req->sector = sector;
+	req->nr_hw_segments = 1;                /* Always 1 for a new request. */
+	req->nr_sectors = count;		/* Length of kiobuf */
+	req->sem = NULL;
+	req->kiobuf = kiobuf; 
+	req->bh = NULL;       
+	req->bhtail = NULL;   
+	req->q = q;
+	/* Calculate req->buffer */
+	curr_offset = kiobuf->offset;
+	for (kioind=0; kioind<kiobuf->nr_pages; kioind++)
+		if (curr_offset >= PAGE_SIZE)	
+			curr_offset -= PAGE_SIZE;
+		else	
+			break;
+	req->buffer = (char *) page_address(kiobuf->maplist[kioind]) +
+	     curr_offset; 
+
+	/* Calculate current_nr_sectors and # of scatter gather segments needed */
+	total_bytes = kiobuf->length;
+	nr_bytes = (PAGE_SIZE - curr_offset) > total_bytes ?
+	     total_bytes : (PAGE_SIZE - curr_offset);
+	req->current_nr_sectors = nr_bytes >> 9; 
+	
+	for (nr_seg = 1;
+	     kioind<kiobuf->nr_pages && nr_bytes != total_bytes;
+	     kioind++) {
+	     ++nr_seg;
+	     if((nr_bytes + PAGE_SIZE) > total_bytes){
+		  break;
+	     } else {
+		  nr_bytes += PAGE_SIZE;
+	     }	
+	}	
+	req->nr_segments = nr_seg;
+
+	add_request(q, req, head, orig_latency);
+	elevator_account_request(elevator, req);
+
+	spin_unlock_irqrestore(&io_request_lock, flags);
+
+end_io:
+	return;
+}
+
+
+
+/*
+ * Function:    ll_rw_kio()
+ *
+ * Purpose:     Insert kiobuf-based request into request queue.
+ *
+ * Arguments:   rw      - read/write
+ *              kiobuf  - collection of pages
+ *		dev	- device against which I/O requested
+ *		blocknr - dev block number at which to start I/O
+ *              sector  - units (512B or other) of blocknr
+ *              error   - return status
+ *
+ * Lock status: Assumed no lock held upon entry.
+ *		Assumed that the pages in the kiobuf ___ARE LOCKED DOWN___.
+ *
+ * Returns:     Nothing
+ *
+ * Notes:       This function is called from any subsystem using kiovec[]
+ *		collection of kiobufs for I/O (e.g. `pagebufs', raw-io). 
+ *		Relies on "kiobuf" field in the request structure.
+ */	
+void ll_rw_kio(int rw,
+	       struct kiobuf *kiobuf,
+	       kdev_t dev,
+	       unsigned long blocknr,
+	       size_t sector,
+	       int *error)
+{
+	request_queue_t *q;
+	/*
+	 * Only support SCSI disk for now.
+	 * 
+	 * ENOSYS to indicate caller
+	 * should try ll_rw_block()
+	 * for non-SCSI (e.g. IDE) disks
+	 * and for MD requests.
+	 */
+	if (!SCSI_DISK_MAJOR(MAJOR(dev)) ||
+	    (MAJOR(dev) == MD_MAJOR)) {
+		*error = -ENOSYS;
+		goto end_io;
+	}
+	/*
+	 * Sanity checks
+	 */
+	q = blk_get_queue(dev);
+	if (!q) {
+		printk(KERN_ERR
+			"ll_rw_kio: Nnonexistent block-device %s\n",
+			kdevname(dev));
+		*error = -ENODEV;
+		goto end_io;
+	}
+	if ((rw & WRITE) && is_read_only(dev)) {
+		printk(KERN_NOTICE "Can't write to read-only device %s\n",
+		       kdevname(dev));
+		*error = -EPERM;
+		goto end_io;
+	}
+	if (q->make_request_fn) {
+		printk(KERN_ERR
+	"ll_rw_kio: Unexpected device [%s] queueing function encountered\n",
+		kdevname(dev));
+		*error = -ENOSYS;
+		goto end_io;
+	}
+	
+	__make_kio_request(q, rw, kiobuf, dev, blocknr, sector);
+	if (kiobuf->errno != 0) {
+		*error = kiobuf->errno;
+		goto end_io;
+	}
+	
+	return;
+end_io:
+	/*
+	 * We come here only on an error so, just set
+	 * kiobuf->errno and call the completion fn.
+	 */
+	if(kiobuf->errno == 0)
+		kiobuf->errno = *error;
+}
+
+
 #ifdef CONFIG_STRAM_SWAP
 extern int stram_device_init (void);
 #endif
@@ -1079,3 +1387,5 @@
 EXPORT_SYMBOL(blk_queue_pluggable);
 EXPORT_SYMBOL(blk_queue_make_request);
 EXPORT_SYMBOL(generic_make_request);
+EXPORT_SYMBOL(__make_kio_request);
+EXPORT_SYMBOL(ll_rw_kio);
--- pre9.2-sct/drivers/char/raw.c	Tue May 23 14:25:36 2000
+++ pre9.2-sct+mine/drivers/char/raw.c	Mon May 22 19:00:09 2000
@@ -238,6 +238,63 @@
 #define SECTOR_SIZE (1U << SECTOR_BITS)
 #define SECTOR_MASK (SECTOR_SIZE - 1)
 
+/*
+ * IO completion routine for a kiobuf-based request.
+ */
+static void end_kiobuf_io_kiobuf(struct kiobuf *kiobuf)
+{
+	kiobuf->locked = 0;
+	if (atomic_dec_and_test(&kiobuf->io_count))
+		wake_up(&kiobuf->wait_queue);
+}
+
+/*
+ * Send I/O down the ll_rw_kio() path first.
+ * It is assumed that any requisite locking
+ * and unlocking of pages in the kiobuf has
+ * been taken care of by the caller.
+ *
+ * Return 0 if I/O should be retried on buffer_head path.
+ * Return number of transferred bytes if successful.
+ * Return -1 value, if there was an I/O error.
+ */
+static inline int try_kiobuf_io(struct kiobuf *iobuf,
+				int rw,
+				unsigned long blocknr,
+				kdev_t dev,
+				char *buf,
+				size_t sector_size)
+{	
+	int err, retval;
+
+	iobuf->end_io = end_kiobuf_io_kiobuf;
+	iobuf->errno = 0;
+	iobuf->locked = 1;
+	atomic_inc(&iobuf->io_count);
+	err = 0;
+	ll_rw_kio(rw, iobuf, dev, blocknr, sector_size, &err);
+
+	if ( err == 0 ) {
+		kiobuf_wait_for_io(iobuf);
+		if (iobuf->errno == 0) {
+			retval = iobuf->length; /* Success */
+		} else {
+			retval = -1;	        /* I/O error */
+		}
+	} else { 
+		atomic_dec(&iobuf->io_count);
+		if ( err == -ENOSYS ) {
+			retval = 0;             /* Retry the buffer_head path */
+		} else {
+			retval = -1;            /* I/O error */
+		}
+	}
+
+	iobuf->locked = 0;
+	return retval;       
+}
+
+
 ssize_t	rw_raw_dev(int rw, struct file *filp, char *buf, 
 		   size_t size, loff_t *offp)
 {
@@ -254,7 +311,7 @@
 
 	int		sector_size, sector_bits, sector_mask;
 	int		max_sectors;
-	
+	int 		kiobuf_io = 1;
 	/*
 	 * First, a few checks on device size limits 
 	 */
@@ -290,17 +347,17 @@
 	if (err)
 		return err;
 
+	blocknr = *offp >> sector_bits;
 	/*
-	 * Split the IO into KIO_MAX_SECTORS chunks, mapping and
-	 * unmapping the single kiobuf as we go to perform each chunk of
-	 * IO.  
+	 * Try sending down the entire kiobuf first via ll_rw_kio().
+	 * If not successful then, split the IO into KIO_MAX_SECTORS
+	 * chunks, mapping and unmapping the single kiobuf as we go
+	 * to perform each chunk of IO.  
 	 */
-
-	transferred = 0;
-	blocknr = *offp >> sector_bits;
+	err = transferred = 0;
 	while (size > 0) {
 		blocks = size >> sector_bits;
-		if (blocks > max_sectors)
+		if ((blocks > max_sectors) && (kiobuf_io == 0))
 			blocks = max_sectors;
 		if (blocks > limit - blocknr)
 			blocks = limit - blocknr;
@@ -318,11 +375,19 @@
 		if (err) 
 			break;
 #endif
-	
-		for (i=0; i < blocks; i++) 
-			b[i] = blocknr++;
-		
-		err = brw_kiovec(rw, 1, &iobuf, dev, b, sector_size);
+		if (kiobuf_io == 0) {
+			for (i=0; i < blocks; i++) 
+			        b[i] = blocknr++;
+			err = brw_kiovec(rw, 1, &iobuf, dev, b, sector_size);
+		} else {
+			err = try_kiobuf_io(iobuf, rw, blocknr, dev, buf, sector_size);
+			if ( err > 0 ) { 
+				blocknr += (err >> sector_bits);
+			} else if ( err == 0 ) {
+				kiobuf_io = 0;
+				continue;
+			} /* else (err<0) => (err!=iosize); exit loop below */
+		}
 
 		if (err >= 0) {
 			transferred += err;
--- pre9.2-sct/drivers/scsi/scsi_lib.c	Tue May 23 14:24:21 2000
+++ pre9.2-sct+mine/drivers/scsi/scsi_lib.c	Tue May 23 14:42:31 2000
@@ -15,6 +15,8 @@
  * a low-level driver if they wished.   Note however that this file also
  * contains the "default" versions of these functions, as we don't want to
  * go through and retrofit queueing functions into all 30 some-odd drivers.
+ *
+ * Support for kiobuf-based I/O requests. [Chaitanya Tumuluri, chait@sgi.com]
  */
 
 #define __NO_VERSION__
@@ -370,6 +372,161 @@
 	spin_unlock_irqrestore(&io_request_lock, flags);
 }
 
+
+/*
+ * Function:    __scsi_collect_bh_sectors()
+ *
+ * Purpose:     Helper routine for __scsi_end_request() to mark some number
+ *		(or all, if that is the case) of sectors complete.
+ *
+ * Arguments:   req      - request struct. from scsi command block.
+ *              uptodate - 1 if I/O indicates success, 0 for I/O error.
+ *              sectors  - number of sectors we want to mark.
+ *		leftovers- indicates if any sectors were not done.
+ *
+ * Lock status: Assumed that lock is not held upon entry.
+ *
+ * Returns:     Nothing
+ *
+ * Notes:	Separate buffer-head processing from kiobuf processing
+ */
+__inline static void __scsi_collect_bh_sectors(struct request *req,
+					       int uptodate,
+					       int sectors,
+					       char **leftovers)
+{
+	struct buffer_head *bh;
+	
+	do {
+		if ((bh = req->bh) != NULL) {
+			req->bh = bh->b_reqnext;
+			req->nr_sectors -= bh->b_size >> 9;
+			req->sector += bh->b_size >> 9;
+			bh->b_reqnext = NULL;		
+			sectors -= bh->b_size >> 9;
+			bh->b_end_io(bh, uptodate);
+			if ((bh = req->bh) != NULL) {
+				req->current_nr_sectors = bh->b_size >> 9;
+				if (req->nr_sectors < req->current_nr_sectors) {
+					req->nr_sectors = req->current_nr_sectors;
+					printk("collect_bh: buffer-list destroyed\n");
+				}	
+			}	
+		}
+	} while (sectors && bh);
+
+	/* Check for leftovers */
+	if (req->bh) {
+		*leftovers = req->bh->b_data;
+	}	
+	return;	
+
+}	
+
+
+/*
+ * Function:    __scsi_collect_kio_sectors()
+ *
+ * Purpose:     Helper routine for __scsi_end_request() to mark some number
+ *		(or all) of the I/O sectors and attendant pages complete.
+ *		Updates the request nr_segments, nr_sectors accordingly.
+ *
+ * Arguments:   req      - request struct. from scsi command block.
+ *              uptodate - 1 if I/O indicates success, 0 for I/O error.
+ *              sectors  - number of sectors we want to mark.
+ *		leftovers- indicates if any sectors were not done.
+ *
+ * Lock status: Assumed that lock is not held upon entry.
+ *
+ * Returns:     Nothing
+ *
+ * Notes:	Separate buffer-head processing from kiobuf processing.
+ *		We don't know if this was a single or multi-segment sgl
+ *		request. Treat it as though it were a multi-segment one.
+ */
+__inline static void __scsi_collect_kio_sectors(struct request *req,
+					       int uptodate,
+					       int sectors,
+					       char **leftovers)
+{
+	int pgcnt, nr_pages;
+	size_t curr_offset;
+	unsigned long va = 0;
+	unsigned int nr_bytes, total_bytes, page_sectors;
+	
+	nr_pages = req->kiobuf->nr_pages;
+	total_bytes = (req->nr_sectors << 9);
+	curr_offset = req->kiobuf->offset;
+
+	/*
+	 * In the case of leftover requests, the kiobuf->length
+	 * remains the same, but req->nr_sectors would be smaller.
+	 * Adjust curr_offset in this case. If not a leftover,
+	 * the following makes no difference.
+	 */
+	curr_offset += (((req->kiobuf->length >> 9) - req->nr_sectors) << 9);
+
+	/* How far into the kiobuf is the offset? */
+	for (pgcnt=0; pgcnt<nr_pages; pgcnt++) {
+		if(curr_offset >= PAGE_SIZE) {
+			curr_offset -= PAGE_SIZE;
+			continue;
+		} else {
+			break;
+		}
+	}
+	/*		
+	 * Reusing the pgcnt and va value from above:
+	 * Harvest pages to account for number of sectors 
+	 * passed into function. 
+	 */
+	for (nr_bytes = 0;
+	     pgcnt<nr_pages && nr_bytes != total_bytes;
+	     pgcnt++) {
+		va = page_address(req->kiobuf->maplist[pgcnt])
+			+ curr_offset;
+		/* First page or final page? Partial page? */
+		if (curr_offset != 0) {
+		     page_sectors = (PAGE_SIZE - curr_offset) > total_bytes ?
+			  total_bytes >> 9 : (PAGE_SIZE - curr_offset) >> 9;
+		     curr_offset = 0;
+		} else if((nr_bytes + PAGE_SIZE) > total_bytes) {
+			page_sectors = (total_bytes - nr_bytes) >> 9;
+		} else {
+			page_sectors = PAGE_SIZE >> 9;
+		}
+		nr_bytes += (page_sectors << 9);
+		/* Leftover sectors in this page (onward)? */
+		if (sectors < page_sectors) {
+			req->nr_sectors -= sectors;
+			req->sector += sectors;
+			req->current_nr_sectors = page_sectors - sectors;
+			va += (sectors << 9); /* Update for req->buffer */
+			sectors = 0;
+			break;
+		} else {
+			/* Mark this page as done */
+			req->nr_segments--;   /* No clustering for kiobuf */ 
+			req->nr_sectors -= page_sectors;
+			req->sector += page_sectors;
+			if (!uptodate && (req->kiobuf->errno != 0)){
+			     req->kiobuf->errno = -EIO;
+			}
+			sectors -= page_sectors;
+		}
+	}
+	
+	/* Check for leftovers */
+	if (req->nr_sectors) {
+		*leftovers = (char *)va;
+	} else if (req->kiobuf->end_io) {
+	     req->kiobuf->end_io(req->kiobuf);
+	}
+	
+	return;	
+}	
+
+
 /*
  * Function:    scsi_end_request()
  *
@@ -397,7 +554,7 @@
 				     int requeue)
 {
 	struct request *req;
-	struct buffer_head *bh;
+	char * leftovers = NULL;
 
 	ASSERT_LOCK(&io_request_lock, 0);
 
@@ -407,39 +564,29 @@
 		printk(" I/O error: dev %s, sector %lu\n",
 		       kdevname(req->rq_dev), req->sector);
 	}
-	do {
-		if ((bh = req->bh) != NULL) {
-			req->bh = bh->b_reqnext;
-			req->nr_sectors -= bh->b_size >> 9;
-			req->sector += bh->b_size >> 9;
-			bh->b_reqnext = NULL;
-			sectors -= bh->b_size >> 9;
-			bh->b_end_io(bh, uptodate);
-			if ((bh = req->bh) != NULL) {
-				req->current_nr_sectors = bh->b_size >> 9;
-				if (req->nr_sectors < req->current_nr_sectors) {
-					req->nr_sectors = req->current_nr_sectors;
-					printk("scsi_end_request: buffer-list destroyed\n");
-				}
-			}
-		}
-	} while (sectors && bh);
 
+	leftovers = NULL;
+	if (req->bh != NULL) {		/* Buffer head based request */
+		__scsi_collect_bh_sectors(req, uptodate, sectors, &leftovers);
+	} else if (req->kiobuf != NULL) { /* Kiobuf based request */
+		__scsi_collect_kio_sectors(req, uptodate, sectors, &leftovers);
+	} else {
+	     panic("Both bh and kiobuf pointers are unset in request!\n");
+	}
 	/*
 	 * If there are blocks left over at the end, set up the command
 	 * to queue the remainder of them.
 	 */
-	if (req->bh) {
+	if (leftovers != NULL) {
                 request_queue_t *q;
 
-		if( !requeue )
-		{
+		if( !requeue ) {
 			return SCpnt;
 		}
 
                 q = &SCpnt->device->request_queue;
 
-		req->buffer = bh->b_data;
+		req->buffer = leftovers;
 		/*
 		 * Bleah.  Leftovers again.  Stick the leftovers in
 		 * the front of the queue, and goose the queue again.
--- pre9.2-sct/drivers/scsi/scsi_merge.c	Tue May 23 14:24:22 2000
+++ pre9.2-sct+mine/drivers/scsi/scsi_merge.c	Tue May 23 14:23:29 2000
@@ -6,6 +6,7 @@
  *                        Based upon conversations with large numbers
  *                        of people at Linux Expo.
  *	Support for dynamic DMA mapping: Jakub Jelinek (jakub@redhat.com).
+ *      Support for kiobuf-based I/O requests. [Chaitanya Tumuluri, chait@sgi.com]
  */
 
 /*
@@ -90,12 +91,13 @@
 	printk("nr_segments is %x\n", req->nr_segments);
 	printk("counted segments is %x\n", segments);
 	printk("Flags %d %d\n", use_clustering, dma_host);
-	for (bh = req->bh; bh->b_reqnext != NULL; bh = bh->b_reqnext) 
-	{
-		printk("Segment 0x%p, blocks %d, addr 0x%lx\n",
-		       bh,
-		       bh->b_size >> 9,
-		       virt_to_phys(bh->b_data - 1));
+	if (req->bh != NULL) {
+		for (bh = req->bh; bh->b_reqnext != NULL; bh = bh->b_reqnext) {	
+			printk("Segment 0x%p, blocks %d, addr 0x%lx\n",
+			       bh,
+			       bh->b_size >> 9,
+			       virt_to_phys(bh->b_data - 1));
+		}
 	}
 	panic("Ththththaats all folks.  Too dangerous to continue.\n");
 }
@@ -298,9 +300,22 @@
 	SHpnt = SCpnt->host;
 	SDpnt = SCpnt->device;
 
-	req->nr_segments = __count_segments(req, 
-					    CLUSTERABLE_DEVICE(SHpnt, SDpnt),
-					    SHpnt->unchecked_isa_dma, NULL);
+	if (req->kiobuf) {
+		/* Since there is no clustering/merging in kiobuf
+		 * requests, the nr_segments is simply a count of
+		 * the number of pages needing I/O. nr_segments is
+	         * updated in __scsi_collect_kio_sectors() called 
+	         * from scsi_end_request(), for the leftover case.
+	         * [chait@sgi.com]
+		 */
+		return;
+	} else if (req->bh) {
+		req->nr_segments = __count_segments(req, 
+						    CLUSTERABLE_DEVICE(SHpnt, SDpnt),
+						    SHpnt->unchecked_isa_dma, NULL);
+	} else {	
+		panic("Both kiobuf and bh pointers are NULL!");
+	}	
 }
 
 #define MERGEABLE_BUFFERS(X,Y) \
@@ -745,6 +760,191 @@
 MERGEREQFCT(scsi_merge_requests_fn_, 0, 0)
 MERGEREQFCT(scsi_merge_requests_fn_c, 1, 0)
 MERGEREQFCT(scsi_merge_requests_fn_dc, 1, 1)
+
+
+
+/*
+ * Function:    scsi_bh_sgl()
+ *
+ * Purpose:     Helper routine to construct S(catter) G(ather) L(ist)
+ *		assuming buffer_head-based request in the Scsi_Cmnd.
+ *
+ * Arguments:   SCpnt   - Command descriptor 
+ *              use_clustering - 1 if host uses clustering
+ *              dma_host - 1 if this host has ISA DMA issues (bus doesn't
+ *                      expose all of the address lines, so that DMA cannot
+ *                      be done from an arbitrary address).
+ *		sgpnt   - pointer to sgl
+ *
+ * Returns:     Number of sg segments in the sgl.
+ *
+ * Notes:       Only the SCpnt argument should be a non-constant variable.
+ *		This functionality was abstracted out of the original code
+ *		in __init_io().
+ */
+__inline static int scsi_bh_sgl(Scsi_Cmnd * SCpnt,
+			      int use_clustering,
+			      int dma_host,
+			      struct scatterlist * sgpnt)
+{
+	int count;
+	struct buffer_head * bh;
+	struct buffer_head * bhprev;
+	
+	bhprev = NULL;
+
+	for (count = 0, bh = SCpnt->request.bh;
+	     bh; bh = bh->b_reqnext) {
+		if (use_clustering && bhprev != NULL) {
+			if (dma_host &&
+			    virt_to_phys(bhprev->b_data) - 1 == ISA_DMA_THRESHOLD) {
+				/* Nothing - fall through */
+			} else if (CONTIGUOUS_BUFFERS(bhprev, bh)) {
+				/*
+				 * This one is OK.  Let it go.  Note that we
+				 * do not have the ability to allocate
+				 * bounce buffer segments > PAGE_SIZE, so
+				 * for now we limit the thing.
+				 */
+				if( dma_host ) {
+#ifdef DMA_SEGMENT_SIZE_LIMITED
+					if( virt_to_phys(bh->b_data) - 1 < ISA_DMA_THRESHOLD
+					    || sgpnt[count - 1].length + bh->b_size <= PAGE_SIZE ) {
+						sgpnt[count - 1].length += bh->b_size;
+						bhprev = bh;
+						continue;
+					}
+#else
+					sgpnt[count - 1].length += bh->b_size;
+					bhprev = bh;
+					continue;
+#endif
+				} else {
+					sgpnt[count - 1].length += bh->b_size;
+					SCpnt->request_bufflen += bh->b_size;
+					bhprev = bh;
+					continue;
+				}
+			}
+		}
+		count++;
+		sgpnt[count - 1].address = bh->b_data;
+		sgpnt[count - 1].length += bh->b_size;
+		if (!dma_host) {
+			SCpnt->request_bufflen += bh->b_size;
+		}
+		bhprev = bh;
+	}
+
+	return count;
+}
+
+
+/*
+ * Function:    scsi_kio_sgl()
+ *
+ * Purpose:     Helper routine to construct S(catter) G(ather) L(ist)
+ *		assuming kiobuf-based request in the Scsi_Cmnd.
+ *
+ * Arguments:   SCpnt   - Command descriptor 
+ *              dma_host - 1 if this host has ISA DMA issues (bus doesn't
+ *                      expose all of the address lines, so that DMA cannot
+ *                      be done from an arbitrary address).
+ *		sgpnt   - pointer to sgl
+ *
+ * Returns:     Number of sg segments in the sgl.
+ *
+ * Notes:       Only the SCpnt argument should be a non-constant variable.
+ *		This functionality was created out of __ini_io() in the
+ *		original implementation for constructing the sgl for
+ *		kiobuf-based I/Os as well.
+ *
+ *		Constructs SCpnt->use_sg sgl segments for the kiobuf.
+ *
+ *		No clustering of pages is attempted unlike the buffer_head
+ *		case. Primarily because the pages in a kiobuf are unlikely to 
+ *		be contiguous. Bears checking.
+ */
+__inline static int scsi_kio_sgl(Scsi_Cmnd * SCpnt,
+			      int dma_host,
+			      struct scatterlist * sgpnt)
+{
+        int pgcnt, nr_seg, curr_seg, nr_sectors;
+	size_t curr_offset;
+	unsigned long va;
+	unsigned int nr_bytes, total_bytes, sgl_seg_bytes;
+
+	curr_seg = SCpnt->use_sg; /* This many sgl segments */
+	nr_sectors = SCpnt->request.nr_sectors;
+	total_bytes = (nr_sectors << 9);
+	curr_offset = SCpnt->request.kiobuf->offset;
+	
+	/*
+	 * In the case of leftover requests, the kiobuf->length
+	 * remains the same, but req->nr_sectors would be smaller.
+	 * Use this difference to adjust curr_offset in this case. 
+	 * If not a leftover, the following makes no difference.
+	 */
+	curr_offset += (((SCpnt->request.kiobuf->length >> 9) - nr_sectors) << 9);
+	/* How far into the kiobuf is the offset? */
+	for (pgcnt=0; pgcnt<SCpnt->request.kiobuf->nr_pages; pgcnt++) {
+		if(curr_offset >= PAGE_SIZE) {
+			curr_offset -= PAGE_SIZE;
+			continue;
+		} else {
+			break;
+		}
+	}
+	/*		
+	 * Reusing the pgcnt value from above:
+	 * Starting at the right page and offset, build curr_seg
+	 * sgl segments (one per page). Account for both a 
+	 * potentially partial last page and unrequired pages 
+	 * at the end of the kiobuf.
+	 */
+	nr_bytes = 0;
+	for (nr_seg = 0; nr_seg < curr_seg; nr_seg++) {
+		va = page_address(SCpnt->request.kiobuf->maplist[pgcnt])
+			+ curr_offset;
+		++pgcnt;
+		
+		/*
+		 * If this is the first page, account for offset.
+		 * If this the final (maybe partial) page, get remainder.
+		 */
+		if (curr_offset != 0) {
+		     sgl_seg_bytes = PAGE_SIZE - curr_offset;
+		     curr_offset = 0;	
+		} else if((nr_bytes + PAGE_SIZE) > total_bytes) {
+		     sgl_seg_bytes = total_bytes - nr_bytes;
+		} else {	
+		     sgl_seg_bytes = PAGE_SIZE;
+		}
+		
+		nr_bytes += sgl_seg_bytes;
+		sgpnt[nr_seg].address = (char *)va;
+		sgpnt[nr_seg].alt_address = 0;
+		sgpnt[nr_seg].length = sgl_seg_bytes;
+
+		if (!dma_host) {
+		     SCpnt->request_bufflen += sgl_seg_bytes;
+		}
+	}
+	/* Sanity Check */
+	if ((nr_bytes > total_bytes) ||
+	    (pgcnt > SCpnt->request.kiobuf->nr_pages)) {
+		printk(KERN_ERR
+		       "scsi_kio_sgl: sgl bytes[%d], request bytes[%d]\n"
+		       "scsi_kio_sgl: pgcnt[%d], kiobuf->pgcnt[%d]!\n",
+		       nr_bytes, total_bytes, pgcnt, SCpnt->request.kiobuf->nr_pages);
+		BUG();
+	}
+	return nr_seg;
+
+}
+
+
+
 /*
  * Function:    __init_io()
  *
@@ -777,6 +977,9 @@
  *              gather list, the sg count in the request won't be valid
  *              (mainly because we don't need queue management functions
  *              which keep the tally uptodate.
+ *
+ *		Modified to handle kiobuf argument in the SCpnt->request
+ *		structure. 
  */
 __inline static int __init_io(Scsi_Cmnd * SCpnt,
 			      int sg_count_valid,
@@ -784,7 +987,6 @@
 			      int dma_host)
 {
 	struct buffer_head * bh;
-	struct buffer_head * bhprev;
 	char		   * buff;
 	int		     count;
 	int		     i;
@@ -799,11 +1001,11 @@
 	 * needed any more.  Need to play with it and see if we hit the
 	 * panic.  If not, then don't bother.
 	 */
-	if (!SCpnt->request.bh) {
+	if ((!SCpnt->request.bh && !SCpnt->request.kiobuf) ||
+	    (SCpnt->request.bh && SCpnt->request.kiobuf)) {
 		/* 
-		 * Case of page request (i.e. raw device), or unlinked buffer 
-		 * Typically used for swapping, but this isn't how we do
-		 * swapping any more.
+		 * Case of unlinked buffer. Typically used for swapping,
+		 * but this isn't how we do swapping any more.
 		 */
 		panic("I believe this is dead code.  If we hit this, I was wrong");
 #if 0
@@ -819,6 +1021,12 @@
 	req = &SCpnt->request;
 	/*
 	 * First we need to know how many scatter gather segments are needed.
+	 *
+	 * Redundant test per comment below indicating sg_count_valid is always
+	 * set to 1.(ll_rw_blk.c's estimate of req->nr_segments is always trusted).
+	 *
+	 * count is initialized in ll_rw_kio() for the kiobuf path and since these
+	 * requests are never merged, the counts are stay valid.
 	 */
 	if (!sg_count_valid) {
 		count = __count_segments(req, use_clustering, dma_host, NULL);
@@ -842,12 +1050,24 @@
 		this_count = SCpnt->request.nr_sectors;
 		goto single_segment;
 	}
+	/* Check if size of the sgl would be greater than the size
+	 * of the host sgl table. In which case, limit the sgl size.
+	 * When the request sectors are harvested after completion of 
+	 * I/O in __scsi_collect_kio_sectors, the additional sectors 
+	 * will be reinjected into the request queue as a special cmd.
+	 * This will be done till all the request sectors are done.
+	 * [chait@sgi.com]
+	 */
+	if((SCpnt->request.kiobuf != NULL) &&
+	   (count > SCpnt->host->sg_tablesize)) {
+		count = SCpnt->host->sg_tablesize - 1;
+	}
 	SCpnt->use_sg = count;
-
 	/* 
 	 * Allocate the actual scatter-gather table itself.
 	 * scsi_malloc can only allocate in chunks of 512 bytes 
 	 */
+	
 	SCpnt->sglist_len = (SCpnt->use_sg
 			     * sizeof(struct scatterlist) + 511) & ~511;
 
@@ -872,51 +1092,14 @@
 	memset(sgpnt, 0, SCpnt->use_sg * sizeof(struct scatterlist));
 	SCpnt->request_buffer = (char *) sgpnt;
 	SCpnt->request_bufflen = 0;
-	bhprev = NULL;
 
-	for (count = 0, bh = SCpnt->request.bh;
-	     bh; bh = bh->b_reqnext) {
-		if (use_clustering && bhprev != NULL) {
-			if (dma_host &&
-			    virt_to_phys(bhprev->b_data) - 1 == ISA_DMA_THRESHOLD) {
-				/* Nothing - fall through */
-			} else if (CONTIGUOUS_BUFFERS(bhprev, bh)) {
-				/*
-				 * This one is OK.  Let it go.  Note that we
-				 * do not have the ability to allocate
-				 * bounce buffer segments > PAGE_SIZE, so
-				 * for now we limit the thing.
-				 */
-				if( dma_host ) {
-#ifdef DMA_SEGMENT_SIZE_LIMITED
-					if( virt_to_phys(bh->b_data) - 1 < ISA_DMA_THRESHOLD
-					    || sgpnt[count - 1].length + bh->b_size <= PAGE_SIZE ) {
-						sgpnt[count - 1].length += bh->b_size;
-						bhprev = bh;
-						continue;
-					}
-#else
-					sgpnt[count - 1].length += bh->b_size;
-					bhprev = bh;
-					continue;
-#endif
-				} else {
-					sgpnt[count - 1].length += bh->b_size;
-					SCpnt->request_bufflen += bh->b_size;
-					bhprev = bh;
-					continue;
-				}
-			}
-		}
-		count++;
-		sgpnt[count - 1].address = bh->b_data;
-		sgpnt[count - 1].length += bh->b_size;
-		if (!dma_host) {
-			SCpnt->request_bufflen += bh->b_size;
-		}
-		bhprev = bh;
+	if (SCpnt->request.bh){
+		count = scsi_bh_sgl(SCpnt, use_clustering, dma_host, sgpnt);
+	} else if (SCpnt->request.kiobuf) {
+		count = scsi_kio_sgl(SCpnt, dma_host, sgpnt);
+	} else {
+		panic("Yowza! Both kiobuf and buffer_head pointers are null!");
 	}
-
 	/*
 	 * Verify that the count is correct.
 	 */
@@ -1009,6 +1192,17 @@
 	scsi_free(SCpnt->request_buffer, SCpnt->sglist_len);
 
 	/*
+	 * Shouldn't ever get here for a kiobuf request.
+	 *
+	 * Since each segment is a page and also, we couldn't
+	 * allocate bounce buffers for even the first page,
+	 * this means that the DMA buffer pool is exhausted!
+	 */
+	if (SCpnt->request.kiobuf){
+		dma_exhausted(SCpnt, 0);
+	}
+
+	/*
 	 * Make an attempt to pick up as much as we reasonably can.
 	 * Just keep adding sectors until the pool starts running kind of
 	 * low.  The limit of 30 is somewhat arbitrary - the point is that
@@ -1043,7 +1237,6 @@
 	 * segment.  Possibly the entire request, or possibly a small
 	 * chunk of the entire request.
 	 */
-	bh = SCpnt->request.bh;
 	buff = SCpnt->request.buffer;
 
 	if (dma_host) {
@@ -1052,7 +1245,7 @@
 		 * back and allocate a really small one - enough to satisfy
 		 * the first buffer.
 		 */
-		if (virt_to_phys(SCpnt->request.bh->b_data)
+	        if (virt_to_phys(SCpnt->request.buffer)
 		    + (this_count << 9) - 1 > ISA_DMA_THRESHOLD) {
 			buff = (char *) scsi_malloc(this_count << 9);
 			if (!buff) {
@@ -1152,3 +1345,21 @@
 		SDpnt->scsi_init_io_fn = scsi_init_io_vdc;
 	}
 }
+/*
+ * Overrides for Emacs so that we almost follow Linus's tabbing style.
+ * Emacs will notice this stuff at the end of the file and automatically
+ * adjust the settings for this buffer only.  This must remain at the end
+ * of the file.
+ * ---------------------------------------------------------------------------
+ * Local variables:
+ * c-indent-level: 4
+ * c-brace-imaginary-offset: 0
+ * c-brace-offset: -4
+ * c-argdecl-indent: 4
+ * c-label-offset: -4
+ * c-continued-statement-offset: 4
+ * c-continued-brace-offset: 0
+ * indent-tabs-mode: nil
+ * tab-width: 8
+ * End:
+ */
--- pre9.2-sct/drivers/scsi/sd.c	Tue May 23 14:24:21 2000
+++ pre9.2-sct+mine/drivers/scsi/sd.c	Mon May 22 17:53:29 2000
@@ -546,6 +546,7 @@
 static void rw_intr(Scsi_Cmnd * SCpnt)
 {
 	int result = SCpnt->result;
+	
 #if CONFIG_SCSI_LOGGING
 	char nbuff[6];
 #endif
@@ -575,8 +576,14 @@
 			(SCpnt->sense_buffer[4] << 16) |
 			(SCpnt->sense_buffer[5] << 8) |
 			SCpnt->sense_buffer[6];
-			if (SCpnt->request.bh != NULL)
-				block_sectors = SCpnt->request.bh->b_size >> 9;
+
+			/* Tweak to support kiobuf-based I/O requests, [chait@sgi.com] */
+			if (SCpnt->request.kiobuf != NULL)
+			        block_sectors = SCpnt->request.kiobuf->length >> 9;
+			else if (SCpnt->request.bh != NULL)
+			        block_sectors = SCpnt->request.bh->b_size >> 9;
+			else
+			        panic("Both kiobuf and bh pointers are null!\n");
 			switch (SCpnt->device->sector_size) {
 			case 1024:
 				error_sector <<= 1;
--- pre9.2-sct/include/linux/blkdev.h	Tue May 23 14:24:35 2000
+++ pre9.2-sct+mine/include/linux/blkdev.h	Tue May 23 13:48:35 2000
@@ -6,6 +6,7 @@
 #include <linux/genhd.h>
 #include <linux/tqueue.h>
 #include <linux/list.h>
+#include <linux/iobuf.h>
 
 struct request_queue;
 typedef struct request_queue request_queue_t;
@@ -39,6 +40,7 @@
 	void * special;
 	char * buffer;
 	struct semaphore * sem;
+	struct kiobuf * kiobuf;
 	struct buffer_head * bh;
 	struct buffer_head * bhtail;
 	request_queue_t * q;
--- pre9.2-sct/include/linux/elevator.h	Tue May 23 14:24:36 2000
+++ pre9.2-sct+mine/include/linux/elevator.h	Mon May 22 19:05:15 2000
@@ -107,7 +107,12 @@
 	elevator->sequence++;
 	if (req->cmd == READ)
 		elevator->read_pendings++;
-	elevator->nr_segments++;
+
+	if (req->kiobuf != NULL) {
+	     elevator->nr_segments += req->nr_segments;
+	} else {
+	     elevator->nr_segments++;
+	}
 }
 
 static inline int elevator_request_latency(elevator_t * elevator, int rw)
--- pre9.2-sct/include/linux/fs.h	Tue May 23 14:24:34 2000
+++ pre9.2-sct+mine/include/linux/fs.h	Mon May 22 17:56:47 2000
@@ -1063,6 +1063,7 @@
 extern struct buffer_head * get_hash_table(kdev_t, int, int);
 extern struct buffer_head * getblk(kdev_t, int, int);
 extern void ll_rw_block(int, int, struct buffer_head * bh[]);
+extern void ll_rw_kio(int , struct kiobuf *, kdev_t, unsigned long, size_t, int *);
 extern int is_read_only(kdev_t);
 extern void __brelse(struct buffer_head *);
 static inline void brelse(struct buffer_head *buf)
--- pre9.2-sct/include/linux/iobuf.h	Tue May 23 14:25:30 2000
+++ pre9.2-sct+mine/include/linux/iobuf.h	Mon May 22 18:01:30 2000
@@ -56,6 +56,7 @@
 	atomic_t	io_count;	/* IOs still in progress */
 	int		errno;		/* Status of completed IO */
 	void		(*end_io) (struct kiobuf *); /* Completion callback */
+	void *k_dev_id;			/* Store kiovec (or pagebuf) here */
 	wait_queue_head_t wait_queue;
 };
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2000-05-23 21:58 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <00c201bfc0d7$56664db0$4d0310ac@fairfax.datafocus.com>
     [not found] ` <200005181955.MAA71492@getafix.engr.sgi.com>
2000-05-19 15:09   ` PATCH: Enhance queueing/scsi-midlayer to handle kiobufs. [Re: Request splits] Stephen C. Tweedie
2000-05-19 15:48     ` Brian Pomerantz
2000-05-19 15:55       ` Stephen C. Tweedie
2000-05-19 16:17         ` Brian Pomerantz
2000-05-19 18:00           ` Chaitanya Tumuluri
2000-05-19 18:11           ` Gérard Roudier
2000-05-19 19:24             ` Brian Pomerantz
2000-05-19 20:43               ` Gérard Roudier
2000-05-20  9:10                 ` Change direct I/O memory model? [Was Re: PATCH: Enhance queueing/scsi-midlayer to handle kiobufs] Mark Mokryn
2000-05-19 17:53         ` PATCH: Enhance queueing/scsi-midlayer to handle kiobufs. [Re: Request splits] Chaitanya Tumuluri
2000-05-19 17:38     ` Chaitanya Tumuluri
2000-05-23 21:58     ` Chaitanya Tumuluri

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox