[RFC PATCH] discarding swap

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH] discarding swap
@ 2008-09-09 21:28 Hugh Dickins
  2008-09-10 17:35 ` Jens Axboe
  0 siblings, 1 reply; 13+ messages in thread
From: Hugh Dickins @ 2008-09-09 21:28 UTC (permalink / raw)
  To: David Woodhouse; +Cc: Jens Axboe, linux-fsdevel, linux-mm

Hi David,

I notice 2.6.27-rc5-mm1 and next trees include "discard" support,
so far being used only by fat.  Is that really intended for 2.6.28?

Here's a proposed patch to use discard on swap.  But I know nothing of
the tradeoffs, nor what really goes on in such a device, and have none
to try, so this patch may be wide of the mark.

I started off with the discard_swap() in sys_swapon(): it seemed obvious
to discard all the old swap contents (except the swap header) at that
point (though it would be unfortunate if someone mistakenly does a fresh
boot of a kernel when they mean to resume from swap disk: discarding
would remove the option to back out before swap gets written).

But I've no idea how long that is liable to take on a device which really
supports discard: discarding an entire partition all at once sounds good
for giving the device the greatest freedom to reorganize itself,
but would that happen quickly enough?

To do that (and to avoid duplicating the same loop within swapfile.c),
I'd like to change the nr_sects argument to blkdev_issue_discard() from
unsigned to sector_t - unsigned long would be large enough for swap, but
sector_t makes more general sense.  And if that change is made, it's
probably right to change sb_issue_discard()'s nr_sects to match?

I also got worried that blkdev_issue_discard() might get stuck in its
loop for a very long time, requests getting freed from the queue before
it quite fills up: so inserted an occasional cond_resched().

It seems odd to me that the data-less blkdev_issue_discard() is limited
at all by max_hw_sectors; but I'm guessing there's a good reason, safety
perhaps, which has forced you to that.

Where else should swap be discarded?  I don't want to add anything into
the swap freeing path, and the locking there would be problematic.  And
I take it that doing a discard of each single page just before rewriting
it would just be a totally pointless waste of time.

But the swap allocation algorithm in scan_swap_map() does already try
to locate a free cluster of swap pages (256 of them, not tunable today
but could easily be: 1MB on x86).  So I think it makes sense to discard
the old swap when we find a free cluster (and forget about discarding
when we don't find one).  That's what I've implemented - but then,
does it still make any sense also to discard all of swap at swapon?

I think that discarding when allocating swap cannot safely use GFP_KERNEL
for its bio_alloc()s: swap_writepage() uses GFP_NOIO, so should be good.
That involves an added gfp_t arg to blkdev_issue_discard() - unless I
copy it to make a swap_issue_discard(), which seems a bad idea.

Here's the proposed patch, or combination of patches: the blkdev and
swap parts should certainly be separated.  Advice welcome - thanks!

Not-yet-Signed-off-by: Hugh Dickins <hugh@veritas.com>
---

 block/blk-barrier.c    |   11 +-
 include/linux/blkdev.h |    9 +-
 include/linux/swap.h   |   15 ++-
 mm/swapfile.c          |  168 +++++++++++++++++++++++++++++++++++++--
 4 files changed, 183 insertions(+), 20 deletions(-)

--- 2.6.27-rc5-mm1/block/blk-barrier.c	2008-09-05 10:05:33.000000000 +0100
+++ linux/block/blk-barrier.c	2008-09-09 20:00:26.000000000 +0100
@@ -332,15 +332,17 @@ static void blkdev_discard_end_io(struct
  * @bdev:	blockdev to issue discard for
  * @sector:	start sector
  * @nr_sects:	number of sectors to discard
+ * @gfp_mask:	memory allocation flags (for bio_alloc)
  *
  * Description:
  *    Issue a discard request for the sectors in question. Does not wait.
  */
-int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
-			 unsigned nr_sects)
+int blkdev_issue_discard(struct block_device *bdev,
+			 sector_t sector, sector_t nr_sects, gfp_t gfp_mask)
 {
 	struct request_queue *q;
 	struct bio *bio;
+	unsigned int iters = 0;
 	int ret = 0;
 
 	if (bdev->bd_disk == NULL)
@@ -354,7 +356,10 @@ int blkdev_issue_discard(struct block_de
 		return -EOPNOTSUPP;
 
 	while (nr_sects && !ret) {
-		bio = bio_alloc(GFP_KERNEL, 0);
+		if ((++iters & 7) == 0)
+			cond_resched();
+
+		bio = bio_alloc(gfp_mask, 0);
 		if (!bio)
 			return -ENOMEM;
 
--- 2.6.27-rc5-mm1/include/linux/blkdev.h	2008-09-05 10:05:51.000000000 +0100
+++ linux/include/linux/blkdev.h	2008-09-09 20:00:26.000000000 +0100
@@ -16,6 +16,7 @@
 #include <linux/bio.h>
 #include <linux/module.h>
 #include <linux/stringify.h>
+#include <linux/gfp.h>
 #include <linux/bsg.h>
 #include <linux/smp.h>
 
@@ -875,15 +876,15 @@ static inline struct request *blk_map_qu
 }
 
 extern int blkdev_issue_flush(struct block_device *, sector_t *);
-extern int blkdev_issue_discard(struct block_device *, sector_t sector,
-				unsigned nr_sects);
+extern int blkdev_issue_discard(struct block_device *,
+				sector_t sector, sector_t nr_sects, gfp_t);
 
 static inline int sb_issue_discard(struct super_block *sb,
-				   sector_t block, unsigned nr_blocks)
+				   sector_t block, sector_t nr_blocks)
 {
 	block <<= (sb->s_blocksize_bits - 9);
 	nr_blocks <<= (sb->s_blocksize_bits - 9);
-	return blkdev_issue_discard(sb->s_bdev, block, nr_blocks);
+	return blkdev_issue_discard(sb->s_bdev, block, nr_blocks, GFP_KERNEL);
 }
 
 /*
--- 2.6.27-rc5-mm1/include/linux/swap.h	2008-09-05 10:05:52.000000000 +0100
+++ linux/include/linux/swap.h	2008-09-09 20:00:26.000000000 +0100
@@ -119,8 +119,9 @@ struct swap_extent {
 
 enum {
 	SWP_USED	= (1 << 0),	/* is slot in swap_info[] used? */
-	SWP_WRITEOK	= (1 << 1),	/* ok to write to this swap?	*/
-	SWP_ACTIVE	= (SWP_USED | SWP_WRITEOK),
+	SWP_WRITEOK	= (1 << 1),	/* ok to write to this swap? */
+	SWP_DISCARDABLE = (1 << 2),	/* blkdev supports discard */
+	SWP_DISCARDING	= (1 << 3),	/* now discarding a free cluster */
 					/* add others here before... */
 	SWP_SCANNING	= (1 << 8),	/* refcount in scan_swap_map */
 };
@@ -134,22 +135,24 @@ enum {
  * The in-memory structure used to track swap areas.
  */
 struct swap_info_struct {
-	unsigned int flags;
+	unsigned long flags;
 	int prio;			/* swap priority */
+	int next;			/* next entry on swap list */
 	struct file *swap_file;
 	struct block_device *bdev;
 	struct list_head extent_list;
 	struct swap_extent *curr_swap_extent;
-	unsigned old_block_size;
-	unsigned short * swap_map;
+	unsigned short *swap_map;
 	unsigned int lowest_bit;
 	unsigned int highest_bit;
+	unsigned int lowest_alloc;
+	unsigned int highest_alloc;
 	unsigned int cluster_next;
 	unsigned int cluster_nr;
 	unsigned int pages;
 	unsigned int max;
 	unsigned int inuse_pages;
-	int next;			/* next entry on swap list */
+	unsigned int old_block_size;
 };
 
 struct swap_list_t {
--- 2.6.27-rc5-mm1/mm/swapfile.c	2008-09-05 10:05:54.000000000 +0100
+++ linux/mm/swapfile.c	2008-09-09 20:00:26.000000000 +0100
@@ -83,13 +83,93 @@ void swap_unplug_io_fn(struct backing_de
 	up_read(&swap_unplug_sem);
 }
 
+/*
+ * swapon tell device that all the old swap contents can be discarded,
+ * to allow the swap device to optimize its wear-levelling.
+ */
+static int discard_swap(struct swap_info_struct *si)
+{
+	struct swap_extent *se;
+	int err = 0;
+
+	list_for_each_entry(se, &si->extent_list, list) {
+		sector_t start_block = se->start_block << (PAGE_SHIFT - 9);
+		pgoff_t nr_blocks = se->nr_pages << (PAGE_SHIFT - 9);
+
+		if (se->start_page == 0) {
+			/* Do not discard the swap header page! */
+			start_block += 1 << (PAGE_SHIFT - 9);
+			nr_blocks -= 1 << (PAGE_SHIFT - 9);
+			if (!nr_blocks)
+				continue;
+		}
+
+		err = blkdev_issue_discard(si->bdev, start_block,
+						nr_blocks, GFP_KERNEL);
+		if (err)
+			break;
+
+		cond_resched();
+	}
+	return err;		/* That will often be -EOPNOTSUPP */
+}
+
+/*
+ * swap allocation tell device that a cluster of swap can now be discarded,
+ * to allow the swap device to optimize its wear-levelling.
+ */
+static void discard_swap_cluster(struct swap_info_struct *si,
+				 pgoff_t start_page, pgoff_t nr_pages)
+{
+	struct swap_extent *se = si->curr_swap_extent;
+	int found_extent = 0;
+
+	while (nr_pages) {
+		struct list_head *lh;
+
+		if (se->start_page <= start_page &&
+		    start_page < se->start_page + se->nr_pages) {
+			pgoff_t offset = start_page - se->start_page;
+			sector_t start_block = se->start_block + offset;
+			pgoff_t nr_blocks = se->nr_pages - offset;
+
+			if (nr_blocks > nr_pages)
+				nr_blocks = nr_pages;
+			start_page += nr_blocks;
+			nr_pages -= nr_blocks;
+
+			if (!found_extent++)
+				si->curr_swap_extent = se;
+
+			start_block <<= PAGE_SHIFT - 9;
+			nr_blocks <<= PAGE_SHIFT - 9;
+			if (blkdev_issue_discard(si->bdev, start_block,
+							nr_blocks, GFP_NOIO))
+				break;
+		}
+
+		lh = se->list.next;
+		if (lh == &si->extent_list)
+			lh = lh->next;
+		se = list_entry(lh, struct swap_extent, list);
+	}
+}
+
+static int wait_for_discard(void *word)
+{
+	schedule();
+	return 0;
+}
+
 #define SWAPFILE_CLUSTER	256
 #define LATENCY_LIMIT		256
 
 static inline unsigned long scan_swap_map(struct swap_info_struct *si)
 {
-	unsigned long offset, last_in_cluster;
+	unsigned long offset;
+	unsigned long last_in_cluster = 0;
 	int latency_ration = LATENCY_LIMIT;
+	int found_free_cluster = 0;
 
 	/* 
 	 * We try to cluster swap pages by allocating them sequentially
@@ -102,10 +182,24 @@ static inline unsigned long scan_swap_ma
 	 */
 
 	si->flags += SWP_SCANNING;
-	if (unlikely(!si->cluster_nr)) {
-		si->cluster_nr = SWAPFILE_CLUSTER - 1;
-		if (si->pages - si->inuse_pages < SWAPFILE_CLUSTER)
+	if (unlikely(!si->cluster_nr--)) {
+		if (si->pages - si->inuse_pages < SWAPFILE_CLUSTER) {
+			si->cluster_nr = SWAPFILE_CLUSTER - 1;
 			goto lowest;
+		}
+		if (si->flags & SWP_DISCARDABLE) {
+			/*
+			 * Start range check on racing allocations, in case
+			 * they overlap the cluster we eventually decide on
+			 * (we scan without swap_lock to allow preemption).
+			 * It's hardly conceivable that cluster_nr could be
+			 * wrapped during our scan, but don't depend on it.
+			 */
+			if (si->lowest_alloc)
+				goto lowest;
+			si->lowest_alloc = si->max;
+			si->highest_alloc = 0;
+		}
 		spin_unlock(&swap_lock);
 
 		offset = si->lowest_bit;
@@ -118,6 +212,8 @@ static inline unsigned long scan_swap_ma
 			else if (offset == last_in_cluster) {
 				spin_lock(&swap_lock);
 				si->cluster_next = offset-SWAPFILE_CLUSTER+1;
+				si->cluster_nr = SWAPFILE_CLUSTER - 1;
+				found_free_cluster = 1;
 				goto cluster;
 			}
 			if (unlikely(--latency_ration < 0)) {
@@ -125,11 +221,13 @@ static inline unsigned long scan_swap_ma
 				latency_ration = LATENCY_LIMIT;
 			}
 		}
+
 		spin_lock(&swap_lock);
+		si->cluster_nr = SWAPFILE_CLUSTER - 1;
+		si->lowest_alloc = 0;
 		goto lowest;
 	}
 
-	si->cluster_nr--;
 cluster:
 	offset = si->cluster_next;
 	if (offset > si->highest_bit)
@@ -151,6 +249,60 @@ checks:	if (!(si->flags & SWP_WRITEOK))
 		si->swap_map[offset] = 1;
 		si->cluster_next = offset + 1;
 		si->flags -= SWP_SCANNING;
+
+		if (si->lowest_alloc) {
+			/*
+			 * Only set when SWP_DISCARDABLE, and there's a scan
+			 * for a free cluster in progress or just completed.
+			 */
+			if (found_free_cluster) {
+				/*
+				 * To optimize wear-levelling, discard the
+				 * old data of the cluster, taking care not to
+				 * discard any of its pages that have already
+				 * been allocated by racing tasks (offset has
+				 * already stepped over any at the beginning).
+				 */
+				if (offset < si->highest_alloc &&
+				    si->lowest_alloc <= last_in_cluster)
+					last_in_cluster = si->lowest_alloc - 1;
+				si->flags |= SWP_DISCARDING;
+				spin_unlock(&swap_lock);
+
+				if (offset < last_in_cluster)
+					discard_swap_cluster(si, offset,
+						last_in_cluster - offset + 1);
+
+				spin_lock(&swap_lock);
+				si->lowest_alloc = 0;
+				si->flags &= ~SWP_DISCARDING;
+
+				smp_mb();	/* wake_up_bit advises this */
+				wake_up_bit(&si->flags, ilog2(SWP_DISCARDING));
+
+			} else if (si->flags & SWP_DISCARDING) {
+				/*
+				 * Delay using pages allocated by racing tasks
+				 * until the whole discard has been issued. We
+				 * could defer that delay until swap_writepage,
+				 * but it's easier to keep this self-contained.
+				 */
+				spin_unlock(&swap_lock);
+				wait_on_bit(&si->flags, ilog2(SWP_DISCARDING),
+					wait_for_discard, TASK_UNINTERRUPTIBLE);
+				spin_lock(&swap_lock);
+			} else {
+				/*
+				 * Note pages allocated by racing tasks while
+				 * scan for a free cluster is in progress, so
+				 * that its final discard can exclude them.
+				 */
+				if (offset < si->lowest_alloc)
+					si->lowest_alloc = offset;
+				if (offset > si->highest_alloc)
+					si->highest_alloc = offset;
+			}
+		}
 		return offset;
 	}
 
@@ -1253,7 +1405,7 @@ asmlinkage long sys_swapoff(const char _
 	spin_lock(&swap_lock);
 	for (type = swap_list.head; type >= 0; type = swap_info[type].next) {
 		p = swap_info + type;
-		if ((p->flags & SWP_ACTIVE) == SWP_ACTIVE) {
+		if (p->flags & SWP_WRITEOK) {
 			if (p->swap_file->f_mapping == mapping)
 				break;
 		}
@@ -1687,6 +1839,8 @@ asmlinkage long sys_swapon(const char __
 		error = -EINVAL;
 		goto bad_swap;
 	}
+	if (discard_swap(p) == 0)
+		p->flags |= SWP_DISCARDABLE;
 
 	mutex_lock(&swapon_mutex);
 	spin_lock(&swap_lock);
@@ -1696,7 +1850,7 @@ asmlinkage long sys_swapon(const char __
 	else
 		p->prio = --least_priority;
 	p->swap_map = swap_map;
-	p->flags = SWP_ACTIVE;
+	p->flags |= SWP_WRITEOK;
 	nr_swap_pages += nr_good_pages;
 	total_swap_pages += nr_good_pages;
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH] discarding swap
  2008-09-09 21:28 [RFC PATCH] discarding swap Hugh Dickins
@ 2008-09-10 17:35 ` Jens Axboe
  2008-09-10 19:51   ` Hugh Dickins
  0 siblings, 1 reply; 13+ messages in thread
From: Jens Axboe @ 2008-09-10 17:35 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: David Woodhouse, linux-fsdevel, linux-mm

On Tue, Sep 09 2008, Hugh Dickins wrote:
> Hi David,
> 
> I notice 2.6.27-rc5-mm1 and next trees include "discard" support,
> so far being used only by fat.  Is that really intended for 2.6.28?
> 
> Here's a proposed patch to use discard on swap.  But I know nothing of
> the tradeoffs, nor what really goes on in such a device, and have none
> to try, so this patch may be wide of the mark.
> 
> I started off with the discard_swap() in sys_swapon(): it seemed obvious
> to discard all the old swap contents (except the swap header) at that
> point (though it would be unfortunate if someone mistakenly does a fresh
> boot of a kernel when they mean to resume from swap disk: discarding
> would remove the option to back out before swap gets written).

I like the idea, it's certainly within the expected use of this.

> But I've no idea how long that is liable to take on a device which really
> supports discard: discarding an entire partition all at once sounds good
> for giving the device the greatest freedom to reorganize itself,
> but would that happen quickly enough?

That, I think, is still something that we will have to wait and see! I
don't know what the timings on this are, presumably the device firmware
could go about its business in the background and let the command
complete quickly. There's a lot of background stuff in the first place,
so I'd be surprised if it wasn't quick. If it's meant for fs usage, it
needs to be quick as well.

> To do that (and to avoid duplicating the same loop within swapfile.c),
> I'd like to change the nr_sects argument to blkdev_issue_discard() from
> unsigned to sector_t - unsigned long would be large enough for swap, but
> sector_t makes more general sense.  And if that change is made, it's
> probably right to change sb_issue_discard()'s nr_sects to match?

I guess that is needed, if we are going to pass the full device sizes in
there. So fine with me.

> I also got worried that blkdev_issue_discard() might get stuck in its
> loop for a very long time, requests getting freed from the queue before
> it quite fills up: so inserted an occasional cond_resched().

Needs experimentation, but yes I agree with the notion of sprinkling
some cond_resched()'s in there.

> It seems odd to me that the data-less blkdev_issue_discard() is limited
> at all by max_hw_sectors; but I'm guessing there's a good reason, safety
> perhaps, which has forced you to that.

The discard request needs to be turned into a hw command at some point,
and for that we still need to fit the offset and size in there. So we
are still limited by 32MB commands on sata w/lba48, even though we are
not moving any data. Suboptimal, but...

> Where else should swap be discarded?  I don't want to add anything into
> the swap freeing path, and the locking there would be problematic.  And
> I take it that doing a discard of each single page just before rewriting
> it would just be a totally pointless waste of time.
> 
> But the swap allocation algorithm in scan_swap_map() does already try
> to locate a free cluster of swap pages (256 of them, not tunable today
> but could easily be: 1MB on x86).  So I think it makes sense to discard
> the old swap when we find a free cluster (and forget about discarding
> when we don't find one).  That's what I've implemented - but then,
> does it still make any sense also to discard all of swap at swapon?

I think so - it's akin to mkfs doing a full discard before making the
file system, yet the file system should also notify the device when it
frees a block. It's a balance of course, don't do it if you are
immediately using the block again.

> I think that discarding when allocating swap cannot safely use GFP_KERNEL
> for its bio_alloc()s: swap_writepage() uses GFP_NOIO, so should be good.
> That involves an added gfp_t arg to blkdev_issue_discard() - unless I
> copy it to make a swap_issue_discard(), which seems a bad idea.

Agree, you need GFP_NOIO there. So the patch to pass in the gfp_t is
just fine, in fact anything but the ioctl path does need that.

> Here's the proposed patch, or combination of patches: the blkdev and
> swap parts should certainly be separated.  Advice welcome - thanks!

I'll snatch up the blk bits and put them in for-2.6.28. OK if I add your
SOB to that?


> 
> Not-yet-Signed-off-by: Hugh Dickins <hugh@veritas.com>
> ---
> 
>  block/blk-barrier.c    |   11 +-
>  include/linux/blkdev.h |    9 +-
>  include/linux/swap.h   |   15 ++-
>  mm/swapfile.c          |  168 +++++++++++++++++++++++++++++++++++++--
>  4 files changed, 183 insertions(+), 20 deletions(-)
> 
> --- 2.6.27-rc5-mm1/block/blk-barrier.c	2008-09-05 10:05:33.000000000 +0100
> +++ linux/block/blk-barrier.c	2008-09-09 20:00:26.000000000 +0100
> @@ -332,15 +332,17 @@ static void blkdev_discard_end_io(struct
>   * @bdev:	blockdev to issue discard for
>   * @sector:	start sector
>   * @nr_sects:	number of sectors to discard
> + * @gfp_mask:	memory allocation flags (for bio_alloc)
>   *
>   * Description:
>   *    Issue a discard request for the sectors in question. Does not wait.
>   */
> -int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
> -			 unsigned nr_sects)
> +int blkdev_issue_discard(struct block_device *bdev,
> +			 sector_t sector, sector_t nr_sects, gfp_t gfp_mask)
>  {
>  	struct request_queue *q;
>  	struct bio *bio;
> +	unsigned int iters = 0;
>  	int ret = 0;
>  
>  	if (bdev->bd_disk == NULL)
> @@ -354,7 +356,10 @@ int blkdev_issue_discard(struct block_de
>  		return -EOPNOTSUPP;
>  
>  	while (nr_sects && !ret) {
> -		bio = bio_alloc(GFP_KERNEL, 0);
> +		if ((++iters & 7) == 0)
> +			cond_resched();
> +
> +		bio = bio_alloc(gfp_mask, 0);
>  		if (!bio)
>  			return -ENOMEM;
>  
> --- 2.6.27-rc5-mm1/include/linux/blkdev.h	2008-09-05 10:05:51.000000000 +0100
> +++ linux/include/linux/blkdev.h	2008-09-09 20:00:26.000000000 +0100
> @@ -16,6 +16,7 @@
>  #include <linux/bio.h>
>  #include <linux/module.h>
>  #include <linux/stringify.h>
> +#include <linux/gfp.h>
>  #include <linux/bsg.h>
>  #include <linux/smp.h>
>  
> @@ -875,15 +876,15 @@ static inline struct request *blk_map_qu
>  }
>  
>  extern int blkdev_issue_flush(struct block_device *, sector_t *);
> -extern int blkdev_issue_discard(struct block_device *, sector_t sector,
> -				unsigned nr_sects);
> +extern int blkdev_issue_discard(struct block_device *,
> +				sector_t sector, sector_t nr_sects, gfp_t);
>  
>  static inline int sb_issue_discard(struct super_block *sb,
> -				   sector_t block, unsigned nr_blocks)
> +				   sector_t block, sector_t nr_blocks)
>  {
>  	block <<= (sb->s_blocksize_bits - 9);
>  	nr_blocks <<= (sb->s_blocksize_bits - 9);
> -	return blkdev_issue_discard(sb->s_bdev, block, nr_blocks);
> +	return blkdev_issue_discard(sb->s_bdev, block, nr_blocks, GFP_KERNEL);
>  }
>  
>  /*
> --- 2.6.27-rc5-mm1/include/linux/swap.h	2008-09-05 10:05:52.000000000 +0100
> +++ linux/include/linux/swap.h	2008-09-09 20:00:26.000000000 +0100
> @@ -119,8 +119,9 @@ struct swap_extent {
>  
>  enum {
>  	SWP_USED	= (1 << 0),	/* is slot in swap_info[] used? */
> -	SWP_WRITEOK	= (1 << 1),	/* ok to write to this swap?	*/
> -	SWP_ACTIVE	= (SWP_USED | SWP_WRITEOK),
> +	SWP_WRITEOK	= (1 << 1),	/* ok to write to this swap? */
> +	SWP_DISCARDABLE = (1 << 2),	/* blkdev supports discard */
> +	SWP_DISCARDING	= (1 << 3),	/* now discarding a free cluster */
>  					/* add others here before... */
>  	SWP_SCANNING	= (1 << 8),	/* refcount in scan_swap_map */
>  };
> @@ -134,22 +135,24 @@ enum {
>   * The in-memory structure used to track swap areas.
>   */
>  struct swap_info_struct {
> -	unsigned int flags;
> +	unsigned long flags;
>  	int prio;			/* swap priority */
> +	int next;			/* next entry on swap list */
>  	struct file *swap_file;
>  	struct block_device *bdev;
>  	struct list_head extent_list;
>  	struct swap_extent *curr_swap_extent;
> -	unsigned old_block_size;
> -	unsigned short * swap_map;
> +	unsigned short *swap_map;
>  	unsigned int lowest_bit;
>  	unsigned int highest_bit;
> +	unsigned int lowest_alloc;
> +	unsigned int highest_alloc;
>  	unsigned int cluster_next;
>  	unsigned int cluster_nr;
>  	unsigned int pages;
>  	unsigned int max;
>  	unsigned int inuse_pages;
> -	int next;			/* next entry on swap list */
> +	unsigned int old_block_size;
>  };
>  
>  struct swap_list_t {
> --- 2.6.27-rc5-mm1/mm/swapfile.c	2008-09-05 10:05:54.000000000 +0100
> +++ linux/mm/swapfile.c	2008-09-09 20:00:26.000000000 +0100
> @@ -83,13 +83,93 @@ void swap_unplug_io_fn(struct backing_de
>  	up_read(&swap_unplug_sem);
>  }
>  
> +/*
> + * swapon tell device that all the old swap contents can be discarded,
> + * to allow the swap device to optimize its wear-levelling.
> + */
> +static int discard_swap(struct swap_info_struct *si)
> +{
> +	struct swap_extent *se;
> +	int err = 0;
> +
> +	list_for_each_entry(se, &si->extent_list, list) {
> +		sector_t start_block = se->start_block << (PAGE_SHIFT - 9);
> +		pgoff_t nr_blocks = se->nr_pages << (PAGE_SHIFT - 9);
> +
> +		if (se->start_page == 0) {
> +			/* Do not discard the swap header page! */
> +			start_block += 1 << (PAGE_SHIFT - 9);
> +			nr_blocks -= 1 << (PAGE_SHIFT - 9);
> +			if (!nr_blocks)
> +				continue;
> +		}
> +
> +		err = blkdev_issue_discard(si->bdev, start_block,
> +						nr_blocks, GFP_KERNEL);
> +		if (err)
> +			break;
> +
> +		cond_resched();
> +	}
> +	return err;		/* That will often be -EOPNOTSUPP */
> +}
> +
> +/*
> + * swap allocation tell device that a cluster of swap can now be discarded,
> + * to allow the swap device to optimize its wear-levelling.
> + */
> +static void discard_swap_cluster(struct swap_info_struct *si,
> +				 pgoff_t start_page, pgoff_t nr_pages)
> +{
> +	struct swap_extent *se = si->curr_swap_extent;
> +	int found_extent = 0;
> +
> +	while (nr_pages) {
> +		struct list_head *lh;
> +
> +		if (se->start_page <= start_page &&
> +		    start_page < se->start_page + se->nr_pages) {
> +			pgoff_t offset = start_page - se->start_page;
> +			sector_t start_block = se->start_block + offset;
> +			pgoff_t nr_blocks = se->nr_pages - offset;
> +
> +			if (nr_blocks > nr_pages)
> +				nr_blocks = nr_pages;
> +			start_page += nr_blocks;
> +			nr_pages -= nr_blocks;
> +
> +			if (!found_extent++)
> +				si->curr_swap_extent = se;
> +
> +			start_block <<= PAGE_SHIFT - 9;
> +			nr_blocks <<= PAGE_SHIFT - 9;
> +			if (blkdev_issue_discard(si->bdev, start_block,
> +							nr_blocks, GFP_NOIO))
> +				break;
> +		}
> +
> +		lh = se->list.next;
> +		if (lh == &si->extent_list)
> +			lh = lh->next;
> +		se = list_entry(lh, struct swap_extent, list);
> +	}
> +}
> +
> +static int wait_for_discard(void *word)
> +{
> +	schedule();
> +	return 0;
> +}
> +
>  #define SWAPFILE_CLUSTER	256
>  #define LATENCY_LIMIT		256
>  
>  static inline unsigned long scan_swap_map(struct swap_info_struct *si)
>  {
> -	unsigned long offset, last_in_cluster;
> +	unsigned long offset;
> +	unsigned long last_in_cluster = 0;
>  	int latency_ration = LATENCY_LIMIT;
> +	int found_free_cluster = 0;
>  
>  	/* 
>  	 * We try to cluster swap pages by allocating them sequentially
> @@ -102,10 +182,24 @@ static inline unsigned long scan_swap_ma
>  	 */
>  
>  	si->flags += SWP_SCANNING;
> -	if (unlikely(!si->cluster_nr)) {
> -		si->cluster_nr = SWAPFILE_CLUSTER - 1;
> -		if (si->pages - si->inuse_pages < SWAPFILE_CLUSTER)
> +	if (unlikely(!si->cluster_nr--)) {
> +		if (si->pages - si->inuse_pages < SWAPFILE_CLUSTER) {
> +			si->cluster_nr = SWAPFILE_CLUSTER - 1;
>  			goto lowest;
> +		}
> +		if (si->flags & SWP_DISCARDABLE) {
> +			/*
> +			 * Start range check on racing allocations, in case
> +			 * they overlap the cluster we eventually decide on
> +			 * (we scan without swap_lock to allow preemption).
> +			 * It's hardly conceivable that cluster_nr could be
> +			 * wrapped during our scan, but don't depend on it.
> +			 */
> +			if (si->lowest_alloc)
> +				goto lowest;
> +			si->lowest_alloc = si->max;
> +			si->highest_alloc = 0;
> +		}
>  		spin_unlock(&swap_lock);
>  
>  		offset = si->lowest_bit;
> @@ -118,6 +212,8 @@ static inline unsigned long scan_swap_ma
>  			else if (offset == last_in_cluster) {
>  				spin_lock(&swap_lock);
>  				si->cluster_next = offset-SWAPFILE_CLUSTER+1;
> +				si->cluster_nr = SWAPFILE_CLUSTER - 1;
> +				found_free_cluster = 1;
>  				goto cluster;
>  			}
>  			if (unlikely(--latency_ration < 0)) {
> @@ -125,11 +221,13 @@ static inline unsigned long scan_swap_ma
>  				latency_ration = LATENCY_LIMIT;
>  			}
>  		}
> +
>  		spin_lock(&swap_lock);
> +		si->cluster_nr = SWAPFILE_CLUSTER - 1;
> +		si->lowest_alloc = 0;
>  		goto lowest;
>  	}
>  
> -	si->cluster_nr--;
>  cluster:
>  	offset = si->cluster_next;
>  	if (offset > si->highest_bit)
> @@ -151,6 +249,60 @@ checks:	if (!(si->flags & SWP_WRITEOK))
>  		si->swap_map[offset] = 1;
>  		si->cluster_next = offset + 1;
>  		si->flags -= SWP_SCANNING;
> +
> +		if (si->lowest_alloc) {
> +			/*
> +			 * Only set when SWP_DISCARDABLE, and there's a scan
> +			 * for a free cluster in progress or just completed.
> +			 */
> +			if (found_free_cluster) {
> +				/*
> +				 * To optimize wear-levelling, discard the
> +				 * old data of the cluster, taking care not to
> +				 * discard any of its pages that have already
> +				 * been allocated by racing tasks (offset has
> +				 * already stepped over any at the beginning).
> +				 */
> +				if (offset < si->highest_alloc &&
> +				    si->lowest_alloc <= last_in_cluster)
> +					last_in_cluster = si->lowest_alloc - 1;
> +				si->flags |= SWP_DISCARDING;
> +				spin_unlock(&swap_lock);
> +
> +				if (offset < last_in_cluster)
> +					discard_swap_cluster(si, offset,
> +						last_in_cluster - offset + 1);
> +
> +				spin_lock(&swap_lock);
> +				si->lowest_alloc = 0;
> +				si->flags &= ~SWP_DISCARDING;
> +
> +				smp_mb();	/* wake_up_bit advises this */
> +				wake_up_bit(&si->flags, ilog2(SWP_DISCARDING));
> +
> +			} else if (si->flags & SWP_DISCARDING) {
> +				/*
> +				 * Delay using pages allocated by racing tasks
> +				 * until the whole discard has been issued. We
> +				 * could defer that delay until swap_writepage,
> +				 * but it's easier to keep this self-contained.
> +				 */
> +				spin_unlock(&swap_lock);
> +				wait_on_bit(&si->flags, ilog2(SWP_DISCARDING),
> +					wait_for_discard, TASK_UNINTERRUPTIBLE);
> +				spin_lock(&swap_lock);
> +			} else {
> +				/*
> +				 * Note pages allocated by racing tasks while
> +				 * scan for a free cluster is in progress, so
> +				 * that its final discard can exclude them.
> +				 */
> +				if (offset < si->lowest_alloc)
> +					si->lowest_alloc = offset;
> +				if (offset > si->highest_alloc)
> +					si->highest_alloc = offset;
> +			}
> +		}
>  		return offset;
>  	}
>  
> @@ -1253,7 +1405,7 @@ asmlinkage long sys_swapoff(const char _
>  	spin_lock(&swap_lock);
>  	for (type = swap_list.head; type >= 0; type = swap_info[type].next) {
>  		p = swap_info + type;
> -		if ((p->flags & SWP_ACTIVE) == SWP_ACTIVE) {
> +		if (p->flags & SWP_WRITEOK) {
>  			if (p->swap_file->f_mapping == mapping)
>  				break;
>  		}
> @@ -1687,6 +1839,8 @@ asmlinkage long sys_swapon(const char __
>  		error = -EINVAL;
>  		goto bad_swap;
>  	}
> +	if (discard_swap(p) == 0)
> +		p->flags |= SWP_DISCARDABLE;
>  
>  	mutex_lock(&swapon_mutex);
>  	spin_lock(&swap_lock);
> @@ -1696,7 +1850,7 @@ asmlinkage long sys_swapon(const char __
>  	else
>  		p->prio = --least_priority;
>  	p->swap_map = swap_map;
> -	p->flags = SWP_ACTIVE;
> +	p->flags |= SWP_WRITEOK;
>  	nr_swap_pages += nr_good_pages;
>  	total_swap_pages += nr_good_pages;
>  

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH] discarding swap
  2008-09-10 17:35 ` Jens Axboe
@ 2008-09-10 19:51   ` Hugh Dickins
  2008-09-10 21:28     ` David Woodhouse
  2008-09-11  8:58     ` Jens Axboe
  0 siblings, 2 replies; 13+ messages in thread
From: Hugh Dickins @ 2008-09-10 19:51 UTC (permalink / raw)
  To: Jens Axboe; +Cc: David Woodhouse, linux-fsdevel, linux-mm

On Wed, 10 Sep 2008, Jens Axboe wrote:
> On Tue, Sep 09 2008, Hugh Dickins wrote:
> 
> > It seems odd to me that the data-less blkdev_issue_discard() is limited
> > at all by max_hw_sectors; but I'm guessing there's a good reason, safety
> > perhaps, which has forced you to that.
> 
> The discard request needs to be turned into a hw command at some point,
> and for that we still need to fit the offset and size in there. So we
> are still limited by 32MB commands on sata w/lba48, even though we are
> not moving any data. Suboptimal, but...

... makes good sense, thanks.

> > Here's the proposed patch, or combination of patches: the blkdev and
> > swap parts should certainly be separated.  Advice welcome - thanks!
> 
> I'll snatch up the blk bits and put them in for-2.6.28. OK if I add your
> SOB to that?

That would be great.  Thanks a lot for all your comments, I'd been
expecting a much rougher ride!  If you've not already put it in,
here's that subset of the patch - change it around as you wish.


[PATCH] block: adjust blkdev_issue_discard for swap

Three mods to blkdev_issue_discard(), thinking ahead to its use on swap:

1. Add gfp_mask argument, so swap allocation can use it where GFP_KERNEL
   might deadlock but GFP_NOIO is safe.

2. Enlarge nr_sects argument from unsigned to sector_t: unsigned long is
   enough to cover a whole swap area, but sector_t suits any partition.

3. Add an occasional cond_resched() into the loop, to avoid risking bad
   latencies when discarding a large area in small max_hw_sectors steps.

Change sb_issue_discard()'s nr_blocks to sector_t too; but no need seen
for a gfp_mask there, just pass GFP_KERNEL down to blkdev_issue_discard().

Signed-off-by: Hugh Dickins <hugh@veritas.com>
---

 block/blk-barrier.c    |   11 ++++++++---
 include/linux/blkdev.h |    9 +++++----
 2 files changed, 13 insertions(+), 7 deletions(-)

--- 2.6.27-rc5-mm1/block/blk-barrier.c	2008-09-05 10:05:33.000000000 +0100
+++ linux/block/blk-barrier.c	2008-09-09 20:00:26.000000000 +0100
@@ -332,15 +332,17 @@ static void blkdev_discard_end_io(struct
  * @bdev:	blockdev to issue discard for
  * @sector:	start sector
  * @nr_sects:	number of sectors to discard
+ * @gfp_mask:	memory allocation flags (for bio_alloc)
  *
  * Description:
  *    Issue a discard request for the sectors in question. Does not wait.
  */
-int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
-			 unsigned nr_sects)
+int blkdev_issue_discard(struct block_device *bdev,
+			 sector_t sector, sector_t nr_sects, gfp_t gfp_mask)
 {
 	struct request_queue *q;
 	struct bio *bio;
+	unsigned int iters = 0;
 	int ret = 0;
 
 	if (bdev->bd_disk == NULL)
@@ -354,7 +356,10 @@ int blkdev_issue_discard(struct block_de
 		return -EOPNOTSUPP;
 
 	while (nr_sects && !ret) {
-		bio = bio_alloc(GFP_KERNEL, 0);
+		if ((++iters & 7) == 0)
+			cond_resched();
+
+		bio = bio_alloc(gfp_mask, 0);
 		if (!bio)
 			return -ENOMEM;
 
--- 2.6.27-rc5-mm1/include/linux/blkdev.h	2008-09-05 10:05:51.000000000 +0100
+++ linux/include/linux/blkdev.h	2008-09-09 20:00:26.000000000 +0100
@@ -16,6 +16,7 @@
 #include <linux/bio.h>
 #include <linux/module.h>
 #include <linux/stringify.h>
+#include <linux/gfp.h>
 #include <linux/bsg.h>
 #include <linux/smp.h>
 
@@ -875,15 +876,15 @@ static inline struct request *blk_map_qu
 }
 
 extern int blkdev_issue_flush(struct block_device *, sector_t *);
-extern int blkdev_issue_discard(struct block_device *, sector_t sector,
-				unsigned nr_sects);
+extern int blkdev_issue_discard(struct block_device *,
+				sector_t sector, sector_t nr_sects, gfp_t);
 
 static inline int sb_issue_discard(struct super_block *sb,
-				   sector_t block, unsigned nr_blocks)
+				   sector_t block, sector_t nr_blocks)
 {
 	block <<= (sb->s_blocksize_bits - 9);
 	nr_blocks <<= (sb->s_blocksize_bits - 9);
-	return blkdev_issue_discard(sb->s_bdev, block, nr_blocks);
+	return blkdev_issue_discard(sb->s_bdev, block, nr_blocks, GFP_KERNEL);
 }
 
 /*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH] discarding swap
  2008-09-10 19:51   ` Hugh Dickins
@ 2008-09-10 21:28     ` David Woodhouse
  2008-09-12 12:10       ` Hugh Dickins
  2008-09-11  8:58     ` Jens Axboe
  1 sibling, 1 reply; 13+ messages in thread
From: David Woodhouse @ 2008-09-10 21:28 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Jens Axboe, linux-fsdevel, linux-mm

On Wed, 2008-09-10 at 20:51 +0100, Hugh Dickins wrote:
> [PATCH] block: adjust blkdev_issue_discard for swap

blkdev_issue_discard() is for naA?ve callers who don't want to have to
think about barriers. You might benefit from issuing discard requests
without an implicit softbarrier, for swap.

Of course, you then have to ensure that a discard can't still be
in-flight and actually happen _after_ a subsequent write to that page.

-- 
David Woodhouse                            Open Source Technology Centre
David.Woodhouse@intel.com                              Intel Corporation

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH] discarding swap
  2008-09-10 19:51   ` Hugh Dickins
  2008-09-10 21:28     ` David Woodhouse
@ 2008-09-11  8:58     ` Jens Axboe
  2008-09-11 10:47       ` Hugh Dickins
  1 sibling, 1 reply; 13+ messages in thread
From: Jens Axboe @ 2008-09-11  8:58 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: David Woodhouse, linux-fsdevel, linux-mm

On Wed, Sep 10 2008, Hugh Dickins wrote:
> On Wed, 10 Sep 2008, Jens Axboe wrote:
> > On Tue, Sep 09 2008, Hugh Dickins wrote:
> > 
> > > It seems odd to me that the data-less blkdev_issue_discard() is limited
> > > at all by max_hw_sectors; but I'm guessing there's a good reason, safety
> > > perhaps, which has forced you to that.
> > 
> > The discard request needs to be turned into a hw command at some point,
> > and for that we still need to fit the offset and size in there. So we
> > are still limited by 32MB commands on sata w/lba48, even though we are
> > not moving any data. Suboptimal, but...
> 
> ... makes good sense, thanks.
> 
> > > Here's the proposed patch, or combination of patches: the blkdev and
> > > swap parts should certainly be separated.  Advice welcome - thanks!
> > 
> > I'll snatch up the blk bits and put them in for-2.6.28. OK if I add your
> > SOB to that?
> 
> That would be great.  Thanks a lot for all your comments, I'd been
> expecting a much rougher ride!  If you've not already put it in,
> here's that subset of the patch - change it around as you wish.
> 
> 
> [PATCH] block: adjust blkdev_issue_discard for swap
> 
> Three mods to blkdev_issue_discard(), thinking ahead to its use on swap:
> 
> 1. Add gfp_mask argument, so swap allocation can use it where GFP_KERNEL
>    might deadlock but GFP_NOIO is safe.
> 
> 2. Enlarge nr_sects argument from unsigned to sector_t: unsigned long is
>    enough to cover a whole swap area, but sector_t suits any partition.
> 
> 3. Add an occasional cond_resched() into the loop, to avoid risking bad
>    latencies when discarding a large area in small max_hw_sectors steps.
> 
> Change sb_issue_discard()'s nr_blocks to sector_t too; but no need seen
> for a gfp_mask there, just pass GFP_KERNEL down to blkdev_issue_discard().
> 
> Signed-off-by: Hugh Dickins <hugh@veritas.com>

Hugh, I applied this - but on 2nd though, I killed the cond_resched()
for two reasons:

- We should only add stuff like that if it's known problematic
- We'll be throttling on the request allocation eventually, once we get
  128 of these in flight.

So if this turns out to be a problem, we can revisit the cond_resched()
solution.

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH] discarding swap
  2008-09-11  8:58     ` Jens Axboe
@ 2008-09-11 10:47       ` Hugh Dickins
  0 siblings, 0 replies; 13+ messages in thread
From: Hugh Dickins @ 2008-09-11 10:47 UTC (permalink / raw)
  To: Jens Axboe; +Cc: David Woodhouse, linux-fsdevel, linux-mm

On Thu, 11 Sep 2008, Jens Axboe wrote:
> On Wed, Sep 10 2008, Hugh Dickins wrote:
> > 
> > 3. Add an occasional cond_resched() into the loop, to avoid risking bad
> >    latencies when discarding a large area in small max_hw_sectors steps.
> 
> Hugh, I applied this - but on 2nd though, I killed the cond_resched()
> for two reasons:

Thanks.  Yes, that was definitely the most dubious part of the patch.

> 
> - We should only add stuff like that if it's known problematic

Fair enough.  I tend to be more proactive than that with mm loops,
and perhaps had it overmuch on my mind because the swap allocation
loop itself used to be such a prime offender.

(There's also the argument that those most worried about such latencies
will be setting CONFIG_PREEMPT=y, in which case no cond_resched() needed:
I like to give that argument some respect, but not take it too far.)

> - We'll be throttling on the request allocation eventually, once we get
>   128 of these in flight.

Yes, my worry was that if the device completes these requests quickly
enough (as we hope it, or many of them, will), blkdev_issue_discard()
may never reach that throttling, despite doing lots more than 128.

> 
> So if this turns out to be a problem, we can revisit the cond_resched()
> solution.

Indeed - and it doesn't affect the blkdev_issue_discard() interface,
just its implementation.

(I'm still mulling over, in between unrelated work,
David's point on the barriers: will reply to that later.)

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH] discarding swap
  2008-09-10 21:28     ` David Woodhouse
@ 2008-09-12 12:10       ` Hugh Dickins
  2008-09-12 14:09         ` David Woodhouse
  0 siblings, 1 reply; 13+ messages in thread
From: Hugh Dickins @ 2008-09-12 12:10 UTC (permalink / raw)
  To: David Woodhouse; +Cc: Jens Axboe, linux-fsdevel, linux-mm

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2773 bytes --]

On Wed, 10 Sep 2008, David Woodhouse wrote:
> On Wed, 2008-09-10 at 20:51 +0100, Hugh Dickins wrote:
> 
> blkdev_issue_discard() is for naïve callers who don't want to have to
> think about barriers. You might benefit from issuing discard requests
> without an implicit softbarrier, for swap.

Whilst I'd certainly categorize myself as a naïve caller, swap should
not be, and I now believe you're right that it would be better for
swap not to be using DISCARD_BARRIER there - thanks for noticing.

For that I think we'd want blk-barrier.c's blkdev_issue_discard() to
become __blkdev_issue_discard() with a fourth arg (either a boolean,
or DISCARD_BARRIER versus DISCARD_NOBARRIER), with blkdev_issue_discard()
and blkdev_issue_discard_nobarrier() functions inlined in blkdev.h.

I don't think it would be wise for mm/swapfile.c to duplicate
blkdev_issue_discard() without the _BARRIER: I expect that function
to go through a few changes as experience gathers with devices coming
onstream, changes we'd rather not track in mm; and I don't think mm
(beyond bounce.c) should get into request_queues and max_hw_sectors.

> Of course, you then have to ensure that a discard can't still be
> in-flight and actually happen _after_ a subsequent write to that page.

I was certainly terrified of a write sneaking down before the discard
when it was supposed to come after, and therefore took comfort from
the DISCARD_BARRIER - but was, I think, failing to understand that
it's for a filesystem which needs guarantees on the ordering
between data and metadata *in different areas* of the partition,
not an issue for swap at all.

Looking at what I ended up with, I had to put "wait_for_discard"
serialization in at the swap end anyway; and though I thought at the
time that the _BARRIER was saving me from waiting for completion
(only having to wait for completed submission), now I don't see
the _BARRIER as playing any part at all.

So long as the I/O schedulers guarantee that a WRITE bio submitted
to an area already covered by a DISCARD_NOBARRIER bio cannot pass that
DISCARD_NOBARRIER - where "already" means the submit_bio(WRITE, bio2)
is issued after the submit_bio(DISCARD_NOBARRIER, bio1) has returned
to caller (but its "I/O" of course not necessarily completed).

That seems a reasonable guarantee to me, and perhaps it's trivially
obvious to those who know their I/O schedulers; but I don't, so I'd
like to hear such assurance given.

(If there's a problem giving that assurance for WRITE, but it can be
given for WRITE_SYNC, that would suit me quite nicely too, because I'm
looking for a justification for WRITE_SYNC in swap_writepage(): Jens,
it makes those x86_64-tmpfs-swapping-on-CFQ cases a lot better.)

Hugh

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH] discarding swap
  2008-09-12 12:10       ` Hugh Dickins
@ 2008-09-12 14:09         ` David Woodhouse
  2008-09-12 15:52           ` Hugh Dickins
  0 siblings, 1 reply; 13+ messages in thread
From: David Woodhouse @ 2008-09-12 14:09 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Jens Axboe, linux-fsdevel, linux-mm

On Fri, 2008-09-12 at 13:10 +0100, Hugh Dickins wrote:
> So long as the I/O schedulers guarantee that a WRITE bio submitted
> to an area already covered by a DISCARD_NOBARRIER bio cannot pass that
> DISCARD_NOBARRIER - ...

> That seems a reasonable guarantee to me, and perhaps it's trivially
> obvious to those who know their I/O schedulers; but I don't, so I'd
> like to hear such assurance given.

No, that's the point. the I/O schedulers _don't_ give you that guarantee
at all. They can treat DISCARD_NOBARRIER just like a write. That's all
it is, really -- a special kind of WRITE request without any data.

But -- and this came as a bit of a shock to me -- they don't guarantee
that writes don't cross writes on their queue. If you issue two WRITE
requests to the same sector, you have to make sure for _yourself_ that
there is some kind of barrier between them to keep them in the right
order.

Does swap do that, when a page on the disk is deallocated and then used
for something else?

-- 
David Woodhouse                            Open Source Technology Centre
David.Woodhouse@intel.com                              Intel Corporation

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH] discarding swap
  2008-09-12 14:09         ` David Woodhouse
@ 2008-09-12 15:52           ` Hugh Dickins
  2008-09-12 16:22             ` David Woodhouse
  2008-09-12 16:50             ` Jamie Lokier
  0 siblings, 2 replies; 13+ messages in thread
From: Hugh Dickins @ 2008-09-12 15:52 UTC (permalink / raw)
  To: David Woodhouse; +Cc: Jens Axboe, linux-fsdevel, linux-mm

On Fri, 12 Sep 2008, David Woodhouse wrote:
> On Fri, 2008-09-12 at 13:10 +0100, Hugh Dickins wrote:
> > So long as the I/O schedulers guarantee that a WRITE bio submitted
> > to an area already covered by a DISCARD_NOBARRIER bio cannot pass that
> > DISCARD_NOBARRIER - ...
> 
> > That seems a reasonable guarantee to me, and perhaps it's trivially
> > obvious to those who know their I/O schedulers; but I don't, so I'd
> > like to hear such assurance given.
> 
> No, that's the point. the I/O schedulers _don't_ give you that guarantee
> at all. They can treat DISCARD_NOBARRIER just like a write. That's all
> it is, really -- a special kind of WRITE request without any data.

Hmmm.  In that case I'll need to continue with DISCARD_BARRIER,
unless/until I rejig swap allocation to wait for discard completion,
which I've no great desire to do.

Is there any particular reason why DISCARD_NOBARRIER shouldn't be
enhanced to give the intuitive guarantee I suggest?  It is distinct
from a WRITE, I don't see why it has to be treated in the same way
if that's unhelpful to its users.

I expect the answer will be: it could be so enhanced, but we really
don't know if it's worth adding special code for that without the
experience of more users.

> 
> But -- and this came as a bit of a shock to me -- they don't guarantee
> that writes don't cross writes on their queue. If you issue two WRITE
> requests to the same sector, you have to make sure for _yourself_ that
> there is some kind of barrier between them to keep them in the right
> order.

Right, I recall from skimming the linux-fsdevel threads that it
emerged that currently WRITEs are depending on page lock for
that serialization, which cannot apply in the discard case.

So, there's been no need for such a guarantee in the WRITE case;
but it sure would be helpful in the DISCARD case, which has no
pages to lock anyway.

> 
> Does swap do that, when a page on the disk is deallocated and then used
> for something else?

Yes, that's managed through the PageWriteback flag: there are various
places where we'd like to free up swap, but cannot do so because it's
still attached to a cached page with PageWriteback set; in which case
its freeing has to be left until vmscan.c finds PageWriteback cleared,
then removes page from swapcache and frees the swap.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH] discarding swap
  2008-09-12 15:52           ` Hugh Dickins
@ 2008-09-12 16:22             ` David Woodhouse
  2008-09-12 17:46               ` Hugh Dickins
  2008-09-12 16:50             ` Jamie Lokier
  1 sibling, 1 reply; 13+ messages in thread
From: David Woodhouse @ 2008-09-12 16:22 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Jens Axboe, linux-fsdevel, linux-mm

On Fri, 2008-09-12 at 16:52 +0100, Hugh Dickins wrote:
> On Fri, 12 Sep 2008, David Woodhouse wrote:
> > On Fri, 2008-09-12 at 13:10 +0100, Hugh Dickins wrote:
> > > So long as the I/O schedulers guarantee that a WRITE bio submitted
> > > to an area already covered by a DISCARD_NOBARRIER bio cannot pass that
> > > DISCARD_NOBARRIER - ...
> > 
> > > That seems a reasonable guarantee to me, and perhaps it's trivially
> > > obvious to those who know their I/O schedulers; but I don't, so I'd
> > > like to hear such assurance given.
> > 
> > No, that's the point. the I/O schedulers _don't_ give you that guarantee
> > at all. They can treat DISCARD_NOBARRIER just like a write. That's all
> > it is, really -- a special kind of WRITE request without any data.
> 
> Hmmm.  In that case I'll need to continue with DISCARD_BARRIER,
> unless/until I rejig swap allocation to wait for discard completion,
> which I've no great desire to do.
> 
> Is there any particular reason why DISCARD_NOBARRIER shouldn't be
> enhanced to give the intuitive guarantee I suggest?  It is distinct
> from a WRITE, I don't see why it has to be treated in the same way
> if that's unhelpful to its users.

The semantics we want would be something like "when a WRITE or DISCARD
request is submitted, automatically turn it into a soft barrier if there
is already an outstanding WRITE or DISCARD request overlapping the same
sectors".

Detecting overlap isn't hard in the single-queue case, but things like
CFQ make it interesting -- you'd have to search _every_ queue. And you
couldn't just do it when inserting barriers -- you need a write to gain
the barrier flag, if it's inserted after a discard. So we really do care
about the performance.

I agree it would be nice to have if we can do it cheaply enough, though.

> I expect the answer will be: it could be so enhanced, but we really
> don't know if it's worth adding special code for that without the
> experience of more users.

That too. We don't yet really know how much the DISCARD requests buy us
in terms of performance or device lifetime. It'll depend a lot on the
internals of the devices, and we don't get told a lot about that.

-- 
David Woodhouse                            Open Source Technology Centre
David.Woodhouse@intel.com                              Intel Corporation

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH] discarding swap
  2008-09-12 15:52           ` Hugh Dickins
  2008-09-12 16:22             ` David Woodhouse
@ 2008-09-12 16:50             ` Jamie Lokier
  2008-09-12 17:25               ` Hugh Dickins
  1 sibling, 1 reply; 13+ messages in thread
From: Jamie Lokier @ 2008-09-12 16:50 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: David Woodhouse, Jens Axboe, linux-fsdevel, linux-mm

Hugh Dickins wrote:
> > Does swap do that, when a page on the disk is deallocated and then used
> > for something else?
> 
> Yes, that's managed through the PageWriteback flag: there are various
> places where we'd like to free up swap, but cannot do so because it's
> still attached to a cached page with PageWriteback set; in which case
> its freeing has to be left until vmscan.c finds PageWriteback cleared,
> then removes page from swapcache and frees the swap.

Here's an idea which is prompted by DISCARD:

One thing the request layer doesn't do is cancellations.
But if it did:

If you schedule some swap to be written, then later it is no longer
required before the WRITE has completed (e.g. process exits), on a
busy system would it be worth _cancelling_ the WRITE while it's still
in the request queue?  This is quite similar to DISCARDing, but
internal to the kernel.

(Many userspace AIO interfaces do have cancellations, perhaps this is why).

-- Jamie

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH] discarding swap
  2008-09-12 16:50             ` Jamie Lokier
@ 2008-09-12 17:25               ` Hugh Dickins
  0 siblings, 0 replies; 13+ messages in thread
From: Hugh Dickins @ 2008-09-12 17:25 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: David Woodhouse, Jens Axboe, linux-fsdevel, linux-mm

On Fri, 12 Sep 2008, Jamie Lokier wrote:
> 
> Here's an idea which is prompted by DISCARD:
> 
> One thing the request layer doesn't do is cancellations.
> But if it did:
> 
> If you schedule some swap to be written, then later it is no longer
> required before the WRITE has completed (e.g. process exits), on a
> busy system would it be worth _cancelling_ the WRITE while it's still
> in the request queue?  This is quite similar to DISCARDing, but
> internal to the kernel.

You mean, like those "So Andso wishes to recall the embarrassing
message accidentally sent to everyone in the company" which I
sometimes see from MS users?

Yes, it could be applicable when there's a huge quantity of I/O in
flight that suddenly becomes redundant, on process exit (for swap)
or file truncation.  But is the upper level likely to want submit
bios for all such pages?  And it only works so long as the bio has
not yet gone out for I/O - therefore seems of limited usefulness?

But might come pretty much for free if it were decided that DISCARD
does need more complicated detect-if-writes-already-queued semantics.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH] discarding swap
  2008-09-12 16:22             ` David Woodhouse
@ 2008-09-12 17:46               ` Hugh Dickins
  0 siblings, 0 replies; 13+ messages in thread
From: Hugh Dickins @ 2008-09-12 17:46 UTC (permalink / raw)
  To: David Woodhouse; +Cc: Jens Axboe, linux-fsdevel, linux-mm

On Fri, 12 Sep 2008, David Woodhouse wrote:
> On Fri, 2008-09-12 at 16:52 +0100, Hugh Dickins wrote:
> > On Fri, 12 Sep 2008, David Woodhouse wrote:
> > > On Fri, 2008-09-12 at 13:10 +0100, Hugh Dickins wrote:
> > > > So long as the I/O schedulers guarantee that a WRITE bio submitted
> > > > to an area already covered by a DISCARD_NOBARRIER bio cannot pass that
> > > > DISCARD_NOBARRIER - ...
> > > 
> > > No, that's the point. the I/O schedulers _don't_ give you that guarantee
> > > at all. They can treat DISCARD_NOBARRIER just like a write. That's all
> > > it is, really -- a special kind of WRITE request without any data.
> > 
> > Hmmm.  In that case I'll need to continue with DISCARD_BARRIER,
> > unless/until I rejig swap allocation to wait for discard completion,
> > which I've no great desire to do.

I'll leave it to Jens to comment on your reply, but I'd like to go
back and add in a further, orthogonal concern or misunderstanding here.

Am I right to be a little perturbed by blk_partition_remap()
and the particular stage at which it's called?

Does it imply that a _BARRIER on swap would have the effect of
inserting a barrier into, say, root and home I/Os too, if swap
and root and home were in separate partitions on the same storage?

Whereas a filesystem would logically only want a barrier to span
its own partition?  (I'm ignoring md/dm.)

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2008-09-12 17:46 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-09-09 21:28 [RFC PATCH] discarding swap Hugh Dickins
2008-09-10 17:35 ` Jens Axboe
2008-09-10 19:51   ` Hugh Dickins
2008-09-10 21:28     ` David Woodhouse
2008-09-12 12:10       ` Hugh Dickins
2008-09-12 14:09         ` David Woodhouse
2008-09-12 15:52           ` Hugh Dickins
2008-09-12 16:22             ` David Woodhouse
2008-09-12 17:46               ` Hugh Dickins
2008-09-12 16:50             ` Jamie Lokier
2008-09-12 17:25               ` Hugh Dickins
2008-09-11  8:58     ` Jens Axboe
2008-09-11 10:47       ` Hugh Dickins

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox