From mboxrd@z Thu Jan  1 00:00:00 1970
Message-Id: <200005232158.OAA77313@getafix.engr.sgi.com>
Subject: Re: PATCH: Enhance queueing/scsi-midlayer to handle kiobufs. [Re: Request splits] 
In-reply-to: Your message of "Fri, 19 May 2000 16:09:58 BST."
             <20000519160958.C9961@redhat.com>
Date: Tue, 23 May 2000 14:58:34 -0700
From: Chaitanya Tumuluri <chait@getafix.engr.sgi.com>
Sender: owner-linux-mm@kvack.org
Return-Path: <owner-linux-mm@kvack.org>
To: "Stephen C. Tweedie" <sct@redhat.com>
Cc: chait@sgi.com, Eric Youngdale <eric@andante.org>, Alan Cox <alan@lxorguk.ukuu.org.uk>, Douglas Gilbert <dgilbert@interlog.com>, Brian Pomerantz <bapper@piratehaven.org>, linux-scsi@vger.rutgers.edu, linux-mm@kvack.org
List-ID: <linux-mm.kvack.org>

On Fri, 19 May 2000 16:09:58 BST, "Stephen C. Tweedie" <sct@redhat.com> wrote:
>Hi,
>
>On Thu, May 18, 2000 at 12:55:04PM -0700, Chaitanya Tumuluri wrote:
> 
>		< stuff deleted >
>
>> So, I enhanced Stephen Tweedie's
>> raw I/O and the queueing/scsi layers to handle kiobufs-based requests. This is
>> in addition to the current buffer_head based request processing.
>
>The "current" kiobuf code is in ftp.uk.linux.org:/pub/linux/sct/fs/raw-io/.
>It includes a number of bug fixes (mainly rationalising the error returns),
>plus a few new significant bits of functionality.  If you can get me a 
>patch against those diffs, I'll include your new code in the main kiobuf
>patchset.  (I'm still maintaining the different kiobuf patches as
>separate patches within that patchset tarball.)
>

Stephen and others,

Here's my patch against the 2.3.99.pre9-2 patchset from your site. The main
differences from my earlier post are:
	- removed the #ifdefs around my code as Stephen Tweedie suggested,
	- corrected indentation problems pointed out earlier (Eric/Alan).
Finally, I'd like to repeat that given the consensus about moving away from
buffer-head based I/O in the future, it makes sense for me to retain the 
little bit of code duplication. This is in the interests of easy surgery
when we do remove the buffer-head I/O paths.

While I see decent (upto 10%) improvement in b/w and turnaround time for
I/O to a single disk, the biggest impact is the (almost 40%) reduction
in CPU utilization with the new codepath. These are from simple `lmdd' tests 
timed with /usr/bin/time.

Based on further feedback from this audience, I would like to propose this 
change to Linus at some point as a general scsi mechanism to handle 
kiobuf-based requests.

Thanks much,
-Chait.

----------------------------CUT HERE---------------------------------------

--- pre9.2-sct/drivers/block/ll_rw_blk.c	Tue May 23 14:24:22 2000
+++ pre9.2-sct+mine/drivers/block/ll_rw_blk.c	Tue May 23 14:38:20 2000
@@ -4,6 +4,7 @@
  * Copyright (C) 1991, 1992 Linus Torvalds
  * Copyright (C) 1994,      Karl Keyte: Added support for disk statistics
  * Elevator latency, (C) 2000  Andrea Arcangeli <andrea@suse.de> SuSE
+ * Support for kiobuf-based I/O requests: Chaitanya Tumuluri [chait@sgi.com]
  */
 
 /*
@@ -639,7 +640,8 @@
 			starving = 1;
 		if (latency < 0)
 			continue;
-
+		if (req->kiobuf)
+			continue;
 		if (req->sem)
 			continue;
 		if (req->cmd != rw)
@@ -744,6 +746,7 @@
 	req->nr_hw_segments = 1; /* Always 1 for a new request. */
 	req->buffer = bh->b_data;
 	req->sem = NULL;
+	req->kiobuf = NULL; 
 	req->bh = bh;
 	req->bhtail = bh;
 	req->q = q;
@@ -886,6 +889,311 @@
 	__ll_rw_block(rw, nr, bh, 1);
 }
 
+/*
+ * Function:    __make_kio_request()
+ *
+ * Purpose:     Construct a kiobuf-based request and insert into request queue.
+ *
+ * Arguments:   q	- request queue of device
+ *              rw      - read/write
+ *              kiobuf  - collection of pages 
+ *		dev	- device against which I/O requested
+ *		blocknr - dev block number at which to start I/O
+ *              blksize - units (512B or other) of blocknr
+ *
+ * Lock status: No lock held upon entry.
+ *  
+ * Returns:     Nothing
+ *
+ * Notes:       Requests generated by this function should _NOT_  be merged by
+ *  		the __make_request() (new check for `req->kiobuf')
+ *
+ *		All (relevant) req->Y parameters are expressed in sector size 
+ *		of 512B for kiobuf based I/O. This is assumed in the scsi
+ *		mid-layer as well.
+ */	
+static inline void __make_kio_request(request_queue_t * q,
+				      int rw,
+				      struct kiobuf * kiobuf,
+				      kdev_t dev,
+				      unsigned long blocknr,
+				      size_t blksize)
+{
+	int major = MAJOR(dev);
+	unsigned int sector, count, nr_bytes, total_bytes, nr_seg;
+	struct request * req;
+	int rw_ahead, max_req;
+	unsigned long flags;
+	struct list_head * head = &q->queue_head;
+	size_t curr_offset;
+	int orig_latency;
+	elevator_t * elevator;
+	int correct_size, i, kioind;
+	
+	/*
+	 * Sanity Tests:	
+	 *
+	 * The input arg. `blocknr' is in units of the 
+	 * input arg. `blksize' (inode->i_sb->s_blocksize).
+	 * Convert to 512B unit used in blk_size[] array.
+	 */
+	count = kiobuf->length >> 9; 
+	sector = blocknr * (blksize >> 9); 
+
+	if (blk_size[major]) {
+		unsigned long maxsector = (blk_size[major][MINOR(dev)] << 1) + 1;
+
+		if (maxsector < count || maxsector - count < sector) {
+			if (!blk_size[major][MINOR(dev)]) {
+				kiobuf->errno = -EINVAL;
+				goto end_io;
+			}
+			/* This may well happen - the kernel calls bread()
+			   without checking the size of the device, e.g.,
+			   when mounting a device. */
+			printk(KERN_INFO
+				"attempt to access beyond end of device\n");
+			printk(KERN_INFO "%s: rw=%d, want=%d, limit=%d\n",
+				kdevname(dev), rw,
+			       (sector + count)>>1,
+			       blk_size[major][MINOR(dev)]);
+			kiobuf->errno = -ESPIPE;
+			goto end_io;
+		}
+	}
+	/*
+	 * Allow only basic block size multiples in the
+	 * kiobuf->length. 
+	 */
+	correct_size = BLOCK_SIZE;
+	if (blksize_size[major]) {
+		i = blksize_size[major][MINOR(dev)];
+		if (i)
+			correct_size = i;
+	}
+	if ((kiobuf->length % correct_size) != 0) {
+		printk(KERN_NOTICE "ll_rw_kio: "
+		       "request size [%d] not a multiple of device [%s] block-size [%d]\n",
+		       kiobuf->length,
+		       kdevname(dev),
+		       correct_size);
+		kiobuf->errno = -EINVAL;
+		goto end_io;
+	}
+	rw_ahead = 0;	/* normal case; gets changed below for READA */
+	switch (rw) {
+		case READA:
+			rw_ahead = 1;
+			rw = READ;	/* drop into READ */
+		case READ:
+			kstat.pgpgin++;
+			max_req = NR_REQUEST;	/* reads take precedence */
+			break;
+		case WRITERAW:
+			rw = WRITE;
+			goto do_write;	/* Skip the buffer refile */
+		case WRITE:
+		do_write:
+			/*
+			 * We don't allow the write-requests to fill up the
+			 * queue completely:  we want some room for reads,
+			 * as they take precedence. The last third of the
+			 * requests are only for reads.
+			 */
+			kstat.pgpgout++;
+			max_req = (NR_REQUEST * 2) / 3;
+			break;
+		default:
+			BUG();
+			kiobuf->errno = -EINVAL;
+			goto end_io;
+	}
+
+	/*
+	 * Creation of bounce buffers for data in high memory
+	 * should (is) be handled lower in the food-chain.
+	 * Ccurrently done in scsi_merge.c for scsi disks.
+	 *
+	 * Look for a free request with spinlock held.
+	 * Apart from atomic queue access, it prevents
+	 * another thread that has already queued a kiobuf-request
+	 * into this queue from starting it, till we are done.
+	 */
+	elevator = &q->elevator;
+	orig_latency = elevator_request_latency(elevator, rw);
+	spin_lock_irqsave(&io_request_lock,flags);
+	
+	if (list_empty(head))
+		q->plug_device_fn(q, dev);
+	/*
+	 * The scsi disk and cdrom drivers completely remove the request
+	 * from the queue when they start processing an entry.  For this
+	 * reason it is safe to continue to add links to the top entry
+	 * for those devices.
+	 *
+	 * All other drivers need to jump over the first entry, as that
+	 * entry may be busy being processed and we thus can't change
+	 * it.
+	 */
+	if (q->head_active && !q->plugged)
+		head = head->next;
+
+	/* find an unused request. */
+	req = get_request(max_req, dev);
+
+	/*
+	 * if no request available: if rw_ahead, forget it,
+	 * otherwise try again blocking..
+	 */
+	if (!req) {
+		spin_unlock_irqrestore(&io_request_lock,flags);
+		if (rw_ahead){
+			kiobuf->errno = -EBUSY;
+			goto end_io;
+		}
+		req = __get_request_wait(max_req, dev);
+		spin_lock_irqsave(&io_request_lock,flags);
+
+		/* revalidate elevator */
+		head = &q->queue_head;
+		if (q->head_active && !q->plugged)
+			head = head->next;
+	}
+
+	/* fill up the request-info, and add it to the queue */
+	req->cmd = rw;
+	req->errors = 0;
+	req->sector = sector;
+	req->nr_hw_segments = 1;                /* Always 1 for a new request. */
+	req->nr_sectors = count;		/* Length of kiobuf */
+	req->sem = NULL;
+	req->kiobuf = kiobuf; 
+	req->bh = NULL;       
+	req->bhtail = NULL;   
+	req->q = q;
+	/* Calculate req->buffer */
+	curr_offset = kiobuf->offset;
+	for (kioind=0; kioind<kiobuf->nr_pages; kioind++)
+		if (curr_offset >= PAGE_SIZE)	
+			curr_offset -= PAGE_SIZE;
+		else	
+			break;
+	req->buffer = (char *) page_address(kiobuf->maplist[kioind]) +
+	     curr_offset; 
+
+	/* Calculate current_nr_sectors and # of scatter gather segments needed */
+	total_bytes = kiobuf->length;
+	nr_bytes = (PAGE_SIZE - curr_offset) > total_bytes ?
+	     total_bytes : (PAGE_SIZE - curr_offset);
+	req->current_nr_sectors = nr_bytes >> 9; 
+	
+	for (nr_seg = 1;
+	     kioind<kiobuf->nr_pages && nr_bytes != total_bytes;
+	     kioind++) {
+	     ++nr_seg;
+	     if((nr_bytes + PAGE_SIZE) > total_bytes){
+		  break;
+	     } else {
+		  nr_bytes += PAGE_SIZE;
+	     }	
+	}	
+	req->nr_segments = nr_seg;
+
+	add_request(q, req, head, orig_latency);
+	elevator_account_request(elevator, req);
+
+	spin_unlock_irqrestore(&io_request_lock, flags);
+
+end_io:
+	return;
+}
+
+
+
+/*
+ * Function:    ll_rw_kio()
+ *
+ * Purpose:     Insert kiobuf-based request into request queue.
+ *
+ * Arguments:   rw      - read/write
+ *              kiobuf  - collection of pages
+ *		dev	- device against which I/O requested
+ *		blocknr - dev block number at which to start I/O
+ *              sector  - units (512B or other) of blocknr
+ *              error   - return status
+ *
+ * Lock status: Assumed no lock held upon entry.
+ *		Assumed that the pages in the kiobuf ___ARE LOCKED DOWN___.
+ *
+ * Returns:     Nothing
+ *
+ * Notes:       This function is called from any subsystem using kiovec[]
+ *		collection of kiobufs for I/O (e.g. `pagebufs', raw-io). 
+ *		Relies on "kiobuf" field in the request structure.
+ */	
+void ll_rw_kio(int rw,
+	       struct kiobuf *kiobuf,
+	       kdev_t dev,
+	       unsigned long blocknr,
+	       size_t sector,
+	       int *error)
+{
+	request_queue_t *q;
+	/*
+	 * Only support SCSI disk for now.
+	 * 
+	 * ENOSYS to indicate caller
+	 * should try ll_rw_block()
+	 * for non-SCSI (e.g. IDE) disks
+	 * and for MD requests.
+	 */
+	if (!SCSI_DISK_MAJOR(MAJOR(dev)) ||
+	    (MAJOR(dev) == MD_MAJOR)) {
+		*error = -ENOSYS;
+		goto end_io;
+	}
+	/*
+	 * Sanity checks
+	 */
+	q = blk_get_queue(dev);
+	if (!q) {
+		printk(KERN_ERR
+			"ll_rw_kio: Nnonexistent block-device %s\n",
+			kdevname(dev));
+		*error = -ENODEV;
+		goto end_io;
+	}
+	if ((rw & WRITE) && is_read_only(dev)) {
+		printk(KERN_NOTICE "Can't write to read-only device %s\n",
+		       kdevname(dev));
+		*error = -EPERM;
+		goto end_io;
+	}
+	if (q->make_request_fn) {
+		printk(KERN_ERR
+	"ll_rw_kio: Unexpected device [%s] queueing function encountered\n",
+		kdevname(dev));
+		*error = -ENOSYS;
+		goto end_io;
+	}
+	
+	__make_kio_request(q, rw, kiobuf, dev, blocknr, sector);
+	if (kiobuf->errno != 0) {
+		*error = kiobuf->errno;
+		goto end_io;
+	}
+	
+	return;
+end_io:
+	/*
+	 * We come here only on an error so, just set
+	 * kiobuf->errno and call the completion fn.
+	 */
+	if(kiobuf->errno == 0)
+		kiobuf->errno = *error;
+}
+
+
 #ifdef CONFIG_STRAM_SWAP
 extern int stram_device_init (void);
 #endif
@@ -1079,3 +1387,5 @@
 EXPORT_SYMBOL(blk_queue_pluggable);
 EXPORT_SYMBOL(blk_queue_make_request);
 EXPORT_SYMBOL(generic_make_request);
+EXPORT_SYMBOL(__make_kio_request);
+EXPORT_SYMBOL(ll_rw_kio);
--- pre9.2-sct/drivers/char/raw.c	Tue May 23 14:25:36 2000
+++ pre9.2-sct+mine/drivers/char/raw.c	Mon May 22 19:00:09 2000
@@ -238,6 +238,63 @@
 #define SECTOR_SIZE (1U << SECTOR_BITS)
 #define SECTOR_MASK (SECTOR_SIZE - 1)
 
+/*
+ * IO completion routine for a kiobuf-based request.
+ */
+static void end_kiobuf_io_kiobuf(struct kiobuf *kiobuf)
+{
+	kiobuf->locked = 0;
+	if (atomic_dec_and_test(&kiobuf->io_count))
+		wake_up(&kiobuf->wait_queue);
+}
+
+/*
+ * Send I/O down the ll_rw_kio() path first.
+ * It is assumed that any requisite locking
+ * and unlocking of pages in the kiobuf has
+ * been taken care of by the caller.
+ *
+ * Return 0 if I/O should be retried on buffer_head path.
+ * Return number of transferred bytes if successful.
+ * Return -1 value, if there was an I/O error.
+ */
+static inline int try_kiobuf_io(struct kiobuf *iobuf,
+				int rw,
+				unsigned long blocknr,
+				kdev_t dev,
+				char *buf,
+				size_t sector_size)
+{	
+	int err, retval;
+
+	iobuf->end_io = end_kiobuf_io_kiobuf;
+	iobuf->errno = 0;
+	iobuf->locked = 1;
+	atomic_inc(&iobuf->io_count);
+	err = 0;
+	ll_rw_kio(rw, iobuf, dev, blocknr, sector_size, &err);
+
+	if ( err == 0 ) {
+		kiobuf_wait_for_io(iobuf);
+		if (iobuf->errno == 0) {
+			retval = iobuf->length; /* Success */
+		} else {
+			retval = -1;	        /* I/O error */
+		}
+	} else { 
+		atomic_dec(&iobuf->io_count);
+		if ( err == -ENOSYS ) {
+			retval = 0;             /* Retry the buffer_head path */
+		} else {
+			retval = -1;            /* I/O error */
+		}
+	}
+
+	iobuf->locked = 0;
+	return retval;       
+}
+
+
 ssize_t	rw_raw_dev(int rw, struct file *filp, char *buf, 
 		   size_t size, loff_t *offp)
 {
@@ -254,7 +311,7 @@
 
 	int		sector_size, sector_bits, sector_mask;
 	int		max_sectors;
-	
+	int 		kiobuf_io = 1;
 	/*
 	 * First, a few checks on device size limits 
 	 */
@@ -290,17 +347,17 @@
 	if (err)
 		return err;
 
+	blocknr = *offp >> sector_bits;
 	/*
-	 * Split the IO into KIO_MAX_SECTORS chunks, mapping and
-	 * unmapping the single kiobuf as we go to perform each chunk of
-	 * IO.  
+	 * Try sending down the entire kiobuf first via ll_rw_kio().
+	 * If not successful then, split the IO into KIO_MAX_SECTORS
+	 * chunks, mapping and unmapping the single kiobuf as we go
+	 * to perform each chunk of IO.  
 	 */
-
-	transferred = 0;
-	blocknr = *offp >> sector_bits;
+	err = transferred = 0;
 	while (size > 0) {
 		blocks = size >> sector_bits;
-		if (blocks > max_sectors)
+		if ((blocks > max_sectors) && (kiobuf_io == 0))
 			blocks = max_sectors;
 		if (blocks > limit - blocknr)
 			blocks = limit - blocknr;
@@ -318,11 +375,19 @@
 		if (err) 
 			break;
 #endif
-	
-		for (i=0; i < blocks; i++) 
-			b[i] = blocknr++;
-		
-		err = brw_kiovec(rw, 1, &iobuf, dev, b, sector_size);
+		if (kiobuf_io == 0) {
+			for (i=0; i < blocks; i++) 
+			        b[i] = blocknr++;
+			err = brw_kiovec(rw, 1, &iobuf, dev, b, sector_size);
+		} else {
+			err = try_kiobuf_io(iobuf, rw, blocknr, dev, buf, sector_size);
+			if ( err > 0 ) { 
+				blocknr += (err >> sector_bits);
+			} else if ( err == 0 ) {
+				kiobuf_io = 0;
+				continue;
+			} /* else (err<0) => (err!=iosize); exit loop below */
+		}
 
 		if (err >= 0) {
 			transferred += err;
--- pre9.2-sct/drivers/scsi/scsi_lib.c	Tue May 23 14:24:21 2000
+++ pre9.2-sct+mine/drivers/scsi/scsi_lib.c	Tue May 23 14:42:31 2000
@@ -15,6 +15,8 @@
  * a low-level driver if they wished.   Note however that this file also
  * contains the "default" versions of these functions, as we don't want to
  * go through and retrofit queueing functions into all 30 some-odd drivers.
+ *
+ * Support for kiobuf-based I/O requests. [Chaitanya Tumuluri, chait@sgi.com]
  */
 
 #define __NO_VERSION__
@@ -370,6 +372,161 @@
 	spin_unlock_irqrestore(&io_request_lock, flags);
 }
 
+
+/*
+ * Function:    __scsi_collect_bh_sectors()
+ *
+ * Purpose:     Helper routine for __scsi_end_request() to mark some number
+ *		(or all, if that is the case) of sectors complete.
+ *
+ * Arguments:   req      - request struct. from scsi command block.
+ *              uptodate - 1 if I/O indicates success, 0 for I/O error.
+ *              sectors  - number of sectors we want to mark.
+ *		leftovers- indicates if any sectors were not done.
+ *
+ * Lock status: Assumed that lock is not held upon entry.
+ *
+ * Returns:     Nothing
+ *
+ * Notes:	Separate buffer-head processing from kiobuf processing
+ */
+__inline static void __scsi_collect_bh_sectors(struct request *req,
+					       int uptodate,
+					       int sectors,
+					       char **leftovers)
+{
+	struct buffer_head *bh;
+	
+	do {
+		if ((bh = req->bh) != NULL) {
+			req->bh = bh->b_reqnext;
+			req->nr_sectors -= bh->b_size >> 9;
+			req->sector += bh->b_size >> 9;
+			bh->b_reqnext = NULL;		
+			sectors -= bh->b_size >> 9;
+			bh->b_end_io(bh, uptodate);
+			if ((bh = req->bh) != NULL) {
+				req->current_nr_sectors = bh->b_size >> 9;
+				if (req->nr_sectors < req->current_nr_sectors) {
+					req->nr_sectors = req->current_nr_sectors;
+					printk("collect_bh: buffer-list destroyed\n");
+				}	
+			}	
+		}
+	} while (sectors && bh);
+
+	/* Check for leftovers */
+	if (req->bh) {
+		*leftovers = req->bh->b_data;
+	}	
+	return;	
+
+}	
+
+
+/*
+ * Function:    __scsi_collect_kio_sectors()
+ *
+ * Purpose:     Helper routine for __scsi_end_request() to mark some number
+ *		(or all) of the I/O sectors and attendant pages complete.
+ *		Updates the request nr_segments, nr_sectors accordingly.
+ *
+ * Arguments:   req      - request struct. from scsi command block.
+ *              uptodate - 1 if I/O indicates success, 0 for I/O error.
+ *              sectors  - number of sectors we want to mark.
+ *		leftovers- indicates if any sectors were not done.
+ *
+ * Lock status: Assumed that lock is not held upon entry.
+ *
+ * Returns:     Nothing
+ *
+ * Notes:	Separate buffer-head processing from kiobuf processing.
+ *		We don't know if this was a single or multi-segment sgl
+ *		request. Treat it as though it were a multi-segment one.
+ */
+__inline static void __scsi_collect_kio_sectors(struct request *req,
+					       int uptodate,
+					       int sectors,
+					       char **leftovers)
+{
+	int pgcnt, nr_pages;
+	size_t curr_offset;
+	unsigned long va = 0;
+	unsigned int nr_bytes, total_bytes, page_sectors;
+	
+	nr_pages = req->kiobuf->nr_pages;
+	total_bytes = (req->nr_sectors << 9);
+	curr_offset = req->kiobuf->offset;
+
+	/*
+	 * In the case of leftover requests, the kiobuf->length
+	 * remains the same, but req->nr_sectors would be smaller.
+	 * Adjust curr_offset in this case. If not a leftover,
+	 * the following makes no difference.
+	 */
+	curr_offset += (((req->kiobuf->length >> 9) - req->nr_sectors) << 9);
+
+	/* How far into the kiobuf is the offset? */
+	for (pgcnt=0; pgcnt<nr_pages; pgcnt++) {
+		if(curr_offset >= PAGE_SIZE) {
+			curr_offset -= PAGE_SIZE;
+			continue;
+		} else {
+			break;
+		}
+	}
+	/*		
+	 * Reusing the pgcnt and va value from above:
+	 * Harvest pages to account for number of sectors 
+	 * passed into function. 
+	 */
+	for (nr_bytes = 0;
+	     pgcnt<nr_pages && nr_bytes != total_bytes;
+	     pgcnt++) {
+		va = page_address(req->kiobuf->maplist[pgcnt])
+			+ curr_offset;
+		/* First page or final page? Partial page? */
+		if (curr_offset != 0) {
+		     page_sectors = (PAGE_SIZE - curr_offset) > total_bytes ?
+			  total_bytes >> 9 : (PAGE_SIZE - curr_offset) >> 9;
+		     curr_offset = 0;
+		} else if((nr_bytes + PAGE_SIZE) > total_bytes) {
+			page_sectors = (total_bytes - nr_bytes) >> 9;
+		} else {
+			page_sectors = PAGE_SIZE >> 9;
+		}
+		nr_bytes += (page_sectors << 9);
+		/* Leftover sectors in this page (onward)? */
+		if (sectors < page_sectors) {
+			req->nr_sectors -= sectors;
+			req->sector += sectors;
+			req->current_nr_sectors = page_sectors - sectors;
+			va += (sectors << 9); /* Update for req->buffer */
+			sectors = 0;
+			break;
+		} else {
+			/* Mark this page as done */
+			req->nr_segments--;   /* No clustering for kiobuf */ 
+			req->nr_sectors -= page_sectors;
+			req->sector += page_sectors;
+			if (!uptodate && (req->kiobuf->errno != 0)){
+			     req->kiobuf->errno = -EIO;
+			}
+			sectors -= page_sectors;
+		}
+	}
+	
+	/* Check for leftovers */
+	if (req->nr_sectors) {
+		*leftovers = (char *)va;
+	} else if (req->kiobuf->end_io) {
+	     req->kiobuf->end_io(req->kiobuf);
+	}
+	
+	return;	
+}	
+
+
 /*
  * Function:    scsi_end_request()
  *
@@ -397,7 +554,7 @@
 				     int requeue)
 {
 	struct request *req;
-	struct buffer_head *bh;
+	char * leftovers = NULL;
 
 	ASSERT_LOCK(&io_request_lock, 0);
 
@@ -407,39 +564,29 @@
 		printk(" I/O error: dev %s, sector %lu\n",
 		       kdevname(req->rq_dev), req->sector);
 	}
-	do {
-		if ((bh = req->bh) != NULL) {
-			req->bh = bh->b_reqnext;
-			req->nr_sectors -= bh->b_size >> 9;
-			req->sector += bh->b_size >> 9;
-			bh->b_reqnext = NULL;
-			sectors -= bh->b_size >> 9;
-			bh->b_end_io(bh, uptodate);
-			if ((bh = req->bh) != NULL) {
-				req->current_nr_sectors = bh->b_size >> 9;
-				if (req->nr_sectors < req->current_nr_sectors) {
-					req->nr_sectors = req->current_nr_sectors;
-					printk("scsi_end_request: buffer-list destroyed\n");
-				}
-			}
-		}
-	} while (sectors && bh);
 
+	leftovers = NULL;
+	if (req->bh != NULL) {		/* Buffer head based request */
+		__scsi_collect_bh_sectors(req, uptodate, sectors, &leftovers);
+	} else if (req->kiobuf != NULL) { /* Kiobuf based request */
+		__scsi_collect_kio_sectors(req, uptodate, sectors, &leftovers);
+	} else {
+	     panic("Both bh and kiobuf pointers are unset in request!\n");
+	}
 	/*
 	 * If there are blocks left over at the end, set up the command
 	 * to queue the remainder of them.
 	 */
-	if (req->bh) {
+	if (leftovers != NULL) {
                 request_queue_t *q;
 
-		if( !requeue )
-		{
+		if( !requeue ) {
 			return SCpnt;
 		}
 
                 q = &SCpnt->device->request_queue;
 
-		req->buffer = bh->b_data;
+		req->buffer = leftovers;
 		/*
 		 * Bleah.  Leftovers again.  Stick the leftovers in
 		 * the front of the queue, and goose the queue again.
--- pre9.2-sct/drivers/scsi/scsi_merge.c	Tue May 23 14:24:22 2000
+++ pre9.2-sct+mine/drivers/scsi/scsi_merge.c	Tue May 23 14:23:29 2000
@@ -6,6 +6,7 @@
  *                        Based upon conversations with large numbers
  *                        of people at Linux Expo.
  *	Support for dynamic DMA mapping: Jakub Jelinek (jakub@redhat.com).
+ *      Support for kiobuf-based I/O requests. [Chaitanya Tumuluri, chait@sgi.com]
  */
 
 /*
@@ -90,12 +91,13 @@
 	printk("nr_segments is %x\n", req->nr_segments);
 	printk("counted segments is %x\n", segments);
 	printk("Flags %d %d\n", use_clustering, dma_host);
-	for (bh = req->bh; bh->b_reqnext != NULL; bh = bh->b_reqnext) 
-	{
-		printk("Segment 0x%p, blocks %d, addr 0x%lx\n",
-		       bh,
-		       bh->b_size >> 9,
-		       virt_to_phys(bh->b_data - 1));
+	if (req->bh != NULL) {
+		for (bh = req->bh; bh->b_reqnext != NULL; bh = bh->b_reqnext) {	
+			printk("Segment 0x%p, blocks %d, addr 0x%lx\n",
+			       bh,
+			       bh->b_size >> 9,
+			       virt_to_phys(bh->b_data - 1));
+		}
 	}
 	panic("Ththththaats all folks.  Too dangerous to continue.\n");
 }
@@ -298,9 +300,22 @@
 	SHpnt = SCpnt->host;
 	SDpnt = SCpnt->device;
 
-	req->nr_segments = __count_segments(req, 
-					    CLUSTERABLE_DEVICE(SHpnt, SDpnt),
-					    SHpnt->unchecked_isa_dma, NULL);
+	if (req->kiobuf) {
+		/* Since there is no clustering/merging in kiobuf
+		 * requests, the nr_segments is simply a count of
+		 * the number of pages needing I/O. nr_segments is
+	         * updated in __scsi_collect_kio_sectors() called 
+	         * from scsi_end_request(), for the leftover case.
+	         * [chait@sgi.com]
+		 */
+		return;
+	} else if (req->bh) {
+		req->nr_segments = __count_segments(req, 
+						    CLUSTERABLE_DEVICE(SHpnt, SDpnt),
+						    SHpnt->unchecked_isa_dma, NULL);
+	} else {	
+		panic("Both kiobuf and bh pointers are NULL!");
+	}	
 }
 
 #define MERGEABLE_BUFFERS(X,Y) \
@@ -745,6 +760,191 @@
 MERGEREQFCT(scsi_merge_requests_fn_, 0, 0)
 MERGEREQFCT(scsi_merge_requests_fn_c, 1, 0)
 MERGEREQFCT(scsi_merge_requests_fn_dc, 1, 1)
+
+
+
+/*
+ * Function:    scsi_bh_sgl()
+ *
+ * Purpose:     Helper routine to construct S(catter) G(ather) L(ist)
+ *		assuming buffer_head-based request in the Scsi_Cmnd.
+ *
+ * Arguments:   SCpnt   - Command descriptor 
+ *              use_clustering - 1 if host uses clustering
+ *              dma_host - 1 if this host has ISA DMA issues (bus doesn't
+ *                      expose all of the address lines, so that DMA cannot
+ *                      be done from an arbitrary address).
+ *		sgpnt   - pointer to sgl
+ *
+ * Returns:     Number of sg segments in the sgl.
+ *
+ * Notes:       Only the SCpnt argument should be a non-constant variable.
+ *		This functionality was abstracted out of the original code
+ *		in __init_io().
+ */
+__inline static int scsi_bh_sgl(Scsi_Cmnd * SCpnt,
+			      int use_clustering,
+			      int dma_host,
+			      struct scatterlist * sgpnt)
+{
+	int count;
+	struct buffer_head * bh;
+	struct buffer_head * bhprev;
+	
+	bhprev = NULL;
+
+	for (count = 0, bh = SCpnt->request.bh;
+	     bh; bh = bh->b_reqnext) {
+		if (use_clustering && bhprev != NULL) {
+			if (dma_host &&
+			    virt_to_phys(bhprev->b_data) - 1 == ISA_DMA_THRESHOLD) {
+				/* Nothing - fall through */
+			} else if (CONTIGUOUS_BUFFERS(bhprev, bh)) {
+				/*
+				 * This one is OK.  Let it go.  Note that we
+				 * do not have the ability to allocate
+				 * bounce buffer segments > PAGE_SIZE, so
+				 * for now we limit the thing.
+				 */
+				if( dma_host ) {
+#ifdef DMA_SEGMENT_SIZE_LIMITED
+					if( virt_to_phys(bh->b_data) - 1 < ISA_DMA_THRESHOLD
+					    || sgpnt[count - 1].length + bh->b_size <= PAGE_SIZE ) {
+						sgpnt[count - 1].length += bh->b_size;
+						bhprev = bh;
+						continue;
+					}
+#else
+					sgpnt[count - 1].length += bh->b_size;
+					bhprev = bh;
+					continue;
+#endif
+				} else {
+					sgpnt[count - 1].length += bh->b_size;
+					SCpnt->request_bufflen += bh->b_size;
+					bhprev = bh;
+					continue;
+				}
+			}
+		}
+		count++;
+		sgpnt[count - 1].address = bh->b_data;
+		sgpnt[count - 1].length += bh->b_size;
+		if (!dma_host) {
+			SCpnt->request_bufflen += bh->b_size;
+		}
+		bhprev = bh;
+	}
+
+	return count;
+}
+
+
+/*
+ * Function:    scsi_kio_sgl()
+ *
+ * Purpose:     Helper routine to construct S(catter) G(ather) L(ist)
+ *		assuming kiobuf-based request in the Scsi_Cmnd.
+ *
+ * Arguments:   SCpnt   - Command descriptor 
+ *              dma_host - 1 if this host has ISA DMA issues (bus doesn't
+ *                      expose all of the address lines, so that DMA cannot
+ *                      be done from an arbitrary address).
+ *		sgpnt   - pointer to sgl
+ *
+ * Returns:     Number of sg segments in the sgl.
+ *
+ * Notes:       Only the SCpnt argument should be a non-constant variable.
+ *		This functionality was created out of __ini_io() in the
+ *		original implementation for constructing the sgl for
+ *		kiobuf-based I/Os as well.
+ *
+ *		Constructs SCpnt->use_sg sgl segments for the kiobuf.
+ *
+ *		No clustering of pages is attempted unlike the buffer_head
+ *		case. Primarily because the pages in a kiobuf are unlikely to 
+ *		be contiguous. Bears checking.
+ */
+__inline static int scsi_kio_sgl(Scsi_Cmnd * SCpnt,
+			      int dma_host,
+			      struct scatterlist * sgpnt)
+{
+        int pgcnt, nr_seg, curr_seg, nr_sectors;
+	size_t curr_offset;
+	unsigned long va;
+	unsigned int nr_bytes, total_bytes, sgl_seg_bytes;
+
+	curr_seg = SCpnt->use_sg; /* This many sgl segments */
+	nr_sectors = SCpnt->request.nr_sectors;
+	total_bytes = (nr_sectors << 9);
+	curr_offset = SCpnt->request.kiobuf->offset;
+	
+	/*
+	 * In the case of leftover requests, the kiobuf->length
+	 * remains the same, but req->nr_sectors would be smaller.
+	 * Use this difference to adjust curr_offset in this case. 
+	 * If not a leftover, the following makes no difference.
+	 */
+	curr_offset += (((SCpnt->request.kiobuf->length >> 9) - nr_sectors) << 9);
+	/* How far into the kiobuf is the offset? */
+	for (pgcnt=0; pgcnt<SCpnt->request.kiobuf->nr_pages; pgcnt++) {
+		if(curr_offset >= PAGE_SIZE) {
+			curr_offset -= PAGE_SIZE;
+			continue;
+		} else {
+			break;
+		}
+	}
+	/*		
+	 * Reusing the pgcnt value from above:
+	 * Starting at the right page and offset, build curr_seg
+	 * sgl segments (one per page). Account for both a 
+	 * potentially partial last page and unrequired pages 
+	 * at the end of the kiobuf.
+	 */
+	nr_bytes = 0;
+	for (nr_seg = 0; nr_seg < curr_seg; nr_seg++) {
+		va = page_address(SCpnt->request.kiobuf->maplist[pgcnt])
+			+ curr_offset;
+		++pgcnt;
+		
+		/*
+		 * If this is the first page, account for offset.
+		 * If this the final (maybe partial) page, get remainder.
+		 */
+		if (curr_offset != 0) {
+		     sgl_seg_bytes = PAGE_SIZE - curr_offset;
+		     curr_offset = 0;	
+		} else if((nr_bytes + PAGE_SIZE) > total_bytes) {
+		     sgl_seg_bytes = total_bytes - nr_bytes;
+		} else {	
+		     sgl_seg_bytes = PAGE_SIZE;
+		}
+		
+		nr_bytes += sgl_seg_bytes;
+		sgpnt[nr_seg].address = (char *)va;
+		sgpnt[nr_seg].alt_address = 0;
+		sgpnt[nr_seg].length = sgl_seg_bytes;
+
+		if (!dma_host) {
+		     SCpnt->request_bufflen += sgl_seg_bytes;
+		}
+	}
+	/* Sanity Check */
+	if ((nr_bytes > total_bytes) ||
+	    (pgcnt > SCpnt->request.kiobuf->nr_pages)) {
+		printk(KERN_ERR
+		       "scsi_kio_sgl: sgl bytes[%d], request bytes[%d]\n"
+		       "scsi_kio_sgl: pgcnt[%d], kiobuf->pgcnt[%d]!\n",
+		       nr_bytes, total_bytes, pgcnt, SCpnt->request.kiobuf->nr_pages);
+		BUG();
+	}
+	return nr_seg;
+
+}
+
+
+
 /*
  * Function:    __init_io()
  *
@@ -777,6 +977,9 @@
  *              gather list, the sg count in the request won't be valid
  *              (mainly because we don't need queue management functions
  *              which keep the tally uptodate.
+ *
+ *		Modified to handle kiobuf argument in the SCpnt->request
+ *		structure. 
  */
 __inline static int __init_io(Scsi_Cmnd * SCpnt,
 			      int sg_count_valid,
@@ -784,7 +987,6 @@
 			      int dma_host)
 {
 	struct buffer_head * bh;
-	struct buffer_head * bhprev;
 	char		   * buff;
 	int		     count;
 	int		     i;
@@ -799,11 +1001,11 @@
 	 * needed any more.  Need to play with it and see if we hit the
 	 * panic.  If not, then don't bother.
 	 */
-	if (!SCpnt->request.bh) {
+	if ((!SCpnt->request.bh && !SCpnt->request.kiobuf) ||
+	    (SCpnt->request.bh && SCpnt->request.kiobuf)) {
 		/* 
-		 * Case of page request (i.e. raw device), or unlinked buffer 
-		 * Typically used for swapping, but this isn't how we do
-		 * swapping any more.
+		 * Case of unlinked buffer. Typically used for swapping,
+		 * but this isn't how we do swapping any more.
 		 */
 		panic("I believe this is dead code.  If we hit this, I was wrong");
 #if 0
@@ -819,6 +1021,12 @@
 	req = &SCpnt->request;
 	/*
 	 * First we need to know how many scatter gather segments are needed.
+	 *
+	 * Redundant test per comment below indicating sg_count_valid is always
+	 * set to 1.(ll_rw_blk.c's estimate of req->nr_segments is always trusted).
+	 *
+	 * count is initialized in ll_rw_kio() for the kiobuf path and since these
+	 * requests are never merged, the counts are stay valid.
 	 */
 	if (!sg_count_valid) {
 		count = __count_segments(req, use_clustering, dma_host, NULL);
@@ -842,12 +1050,24 @@
 		this_count = SCpnt->request.nr_sectors;
 		goto single_segment;
 	}
+	/* Check if size of the sgl would be greater than the size
+	 * of the host sgl table. In which case, limit the sgl size.
+	 * When the request sectors are harvested after completion of 
+	 * I/O in __scsi_collect_kio_sectors, the additional sectors 
+	 * will be reinjected into the request queue as a special cmd.
+	 * This will be done till all the request sectors are done.
+	 * [chait@sgi.com]
+	 */
+	if((SCpnt->request.kiobuf != NULL) &&
+	   (count > SCpnt->host->sg_tablesize)) {
+		count = SCpnt->host->sg_tablesize - 1;
+	}
 	SCpnt->use_sg = count;
-
 	/* 
 	 * Allocate the actual scatter-gather table itself.
 	 * scsi_malloc can only allocate in chunks of 512 bytes 
 	 */
+	
 	SCpnt->sglist_len = (SCpnt->use_sg
 			     * sizeof(struct scatterlist) + 511) & ~511;
 
@@ -872,51 +1092,14 @@
 	memset(sgpnt, 0, SCpnt->use_sg * sizeof(struct scatterlist));
 	SCpnt->request_buffer = (char *) sgpnt;
 	SCpnt->request_bufflen = 0;
-	bhprev = NULL;
 
-	for (count = 0, bh = SCpnt->request.bh;
-	     bh; bh = bh->b_reqnext) {
-		if (use_clustering && bhprev != NULL) {
-			if (dma_host &&
-			    virt_to_phys(bhprev->b_data) - 1 == ISA_DMA_THRESHOLD) {
-				/* Nothing - fall through */
-			} else if (CONTIGUOUS_BUFFERS(bhprev, bh)) {
-				/*
-				 * This one is OK.  Let it go.  Note that we
-				 * do not have the ability to allocate
-				 * bounce buffer segments > PAGE_SIZE, so
-				 * for now we limit the thing.
-				 */
-				if( dma_host ) {
-#ifdef DMA_SEGMENT_SIZE_LIMITED
-					if( virt_to_phys(bh->b_data) - 1 < ISA_DMA_THRESHOLD
-					    || sgpnt[count - 1].length + bh->b_size <= PAGE_SIZE ) {
-						sgpnt[count - 1].length += bh->b_size;
-						bhprev = bh;
-						continue;
-					}
-#else
-					sgpnt[count - 1].length += bh->b_size;
-					bhprev = bh;
-					continue;
-#endif
-				} else {
-					sgpnt[count - 1].length += bh->b_size;
-					SCpnt->request_bufflen += bh->b_size;
-					bhprev = bh;
-					continue;
-				}
-			}
-		}
-		count++;
-		sgpnt[count - 1].address = bh->b_data;
-		sgpnt[count - 1].length += bh->b_size;
-		if (!dma_host) {
-			SCpnt->request_bufflen += bh->b_size;
-		}
-		bhprev = bh;
+	if (SCpnt->request.bh){
+		count = scsi_bh_sgl(SCpnt, use_clustering, dma_host, sgpnt);
+	} else if (SCpnt->request.kiobuf) {
+		count = scsi_kio_sgl(SCpnt, dma_host, sgpnt);
+	} else {
+		panic("Yowza! Both kiobuf and buffer_head pointers are null!");
 	}
-
 	/*
 	 * Verify that the count is correct.
 	 */
@@ -1009,6 +1192,17 @@
 	scsi_free(SCpnt->request_buffer, SCpnt->sglist_len);
 
 	/*
+	 * Shouldn't ever get here for a kiobuf request.
+	 *
+	 * Since each segment is a page and also, we couldn't
+	 * allocate bounce buffers for even the first page,
+	 * this means that the DMA buffer pool is exhausted!
+	 */
+	if (SCpnt->request.kiobuf){
+		dma_exhausted(SCpnt, 0);
+	}
+
+	/*
 	 * Make an attempt to pick up as much as we reasonably can.
 	 * Just keep adding sectors until the pool starts running kind of
 	 * low.  The limit of 30 is somewhat arbitrary - the point is that
@@ -1043,7 +1237,6 @@
 	 * segment.  Possibly the entire request, or possibly a small
 	 * chunk of the entire request.
 	 */
-	bh = SCpnt->request.bh;
 	buff = SCpnt->request.buffer;
 
 	if (dma_host) {
@@ -1052,7 +1245,7 @@
 		 * back and allocate a really small one - enough to satisfy
 		 * the first buffer.
 		 */
-		if (virt_to_phys(SCpnt->request.bh->b_data)
+	        if (virt_to_phys(SCpnt->request.buffer)
 		    + (this_count << 9) - 1 > ISA_DMA_THRESHOLD) {
 			buff = (char *) scsi_malloc(this_count << 9);
 			if (!buff) {
@@ -1152,3 +1345,21 @@
 		SDpnt->scsi_init_io_fn = scsi_init_io_vdc;
 	}
 }
+/*
+ * Overrides for Emacs so that we almost follow Linus's tabbing style.
+ * Emacs will notice this stuff at the end of the file and automatically
+ * adjust the settings for this buffer only.  This must remain at the end
+ * of the file.
+ * ---------------------------------------------------------------------------
+ * Local variables:
+ * c-indent-level: 4
+ * c-brace-imaginary-offset: 0
+ * c-brace-offset: -4
+ * c-argdecl-indent: 4
+ * c-label-offset: -4
+ * c-continued-statement-offset: 4
+ * c-continued-brace-offset: 0
+ * indent-tabs-mode: nil
+ * tab-width: 8
+ * End:
+ */
--- pre9.2-sct/drivers/scsi/sd.c	Tue May 23 14:24:21 2000
+++ pre9.2-sct+mine/drivers/scsi/sd.c	Mon May 22 17:53:29 2000
@@ -546,6 +546,7 @@
 static void rw_intr(Scsi_Cmnd * SCpnt)
 {
 	int result = SCpnt->result;
+	
 #if CONFIG_SCSI_LOGGING
 	char nbuff[6];
 #endif
@@ -575,8 +576,14 @@
 			(SCpnt->sense_buffer[4] << 16) |
 			(SCpnt->sense_buffer[5] << 8) |
 			SCpnt->sense_buffer[6];
-			if (SCpnt->request.bh != NULL)
-				block_sectors = SCpnt->request.bh->b_size >> 9;
+
+			/* Tweak to support kiobuf-based I/O requests, [chait@sgi.com] */
+			if (SCpnt->request.kiobuf != NULL)
+			        block_sectors = SCpnt->request.kiobuf->length >> 9;
+			else if (SCpnt->request.bh != NULL)
+			        block_sectors = SCpnt->request.bh->b_size >> 9;
+			else
+			        panic("Both kiobuf and bh pointers are null!\n");
 			switch (SCpnt->device->sector_size) {
 			case 1024:
 				error_sector <<= 1;
--- pre9.2-sct/include/linux/blkdev.h	Tue May 23 14:24:35 2000
+++ pre9.2-sct+mine/include/linux/blkdev.h	Tue May 23 13:48:35 2000
@@ -6,6 +6,7 @@
 #include <linux/genhd.h>
 #include <linux/tqueue.h>
 #include <linux/list.h>
+#include <linux/iobuf.h>
 
 struct request_queue;
 typedef struct request_queue request_queue_t;
@@ -39,6 +40,7 @@
 	void * special;
 	char * buffer;
 	struct semaphore * sem;
+	struct kiobuf * kiobuf;
 	struct buffer_head * bh;
 	struct buffer_head * bhtail;
 	request_queue_t * q;
--- pre9.2-sct/include/linux/elevator.h	Tue May 23 14:24:36 2000
+++ pre9.2-sct+mine/include/linux/elevator.h	Mon May 22 19:05:15 2000
@@ -107,7 +107,12 @@
 	elevator->sequence++;
 	if (req->cmd == READ)
 		elevator->read_pendings++;
-	elevator->nr_segments++;
+
+	if (req->kiobuf != NULL) {
+	     elevator->nr_segments += req->nr_segments;
+	} else {
+	     elevator->nr_segments++;
+	}
 }
 
 static inline int elevator_request_latency(elevator_t * elevator, int rw)
--- pre9.2-sct/include/linux/fs.h	Tue May 23 14:24:34 2000
+++ pre9.2-sct+mine/include/linux/fs.h	Mon May 22 17:56:47 2000
@@ -1063,6 +1063,7 @@
 extern struct buffer_head * get_hash_table(kdev_t, int, int);
 extern struct buffer_head * getblk(kdev_t, int, int);
 extern void ll_rw_block(int, int, struct buffer_head * bh[]);
+extern void ll_rw_kio(int , struct kiobuf *, kdev_t, unsigned long, size_t, int *);
 extern int is_read_only(kdev_t);
 extern void __brelse(struct buffer_head *);
 static inline void brelse(struct buffer_head *buf)
--- pre9.2-sct/include/linux/iobuf.h	Tue May 23 14:25:30 2000
+++ pre9.2-sct+mine/include/linux/iobuf.h	Mon May 22 18:01:30 2000
@@ -56,6 +56,7 @@
 	atomic_t	io_count;	/* IOs still in progress */
 	int		errno;		/* Status of completed IO */
 	void		(*end_io) (struct kiobuf *); /* Completion callback */
+	void *k_dev_id;			/* Store kiovec (or pagebuf) here */
 	wait_queue_head_t wait_queue;
 };
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/