Re: [patch 1/4 v6]swap: change block allocation algorithm for SSD

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Shaohua Li <shli@kernel.org>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org, riel@redhat.com, minchan@kernel.org,
	kmpark@infradead.org, hughd@google.com, aquini@redhat.com
Subject: Re: [patch 1/4 v6]swap: change block allocation algorithm for SSD
Date: Thu, 18 Jul 2013 18:33:10 +0800	[thread overview]
Message-ID: <20130718103310.GA25547@kernel.org> (raw)
In-Reply-To: <20130717150007.ff10504603266dc221763315@linux-foundation.org>

On Wed, Jul 17, 2013 at 03:00:07PM -0700, Andrew Morton wrote:
> On Tue, 16 Jul 2013 04:43:20 +0800 Shaohua Li <shli@kernel.org> wrote:
> 
> > I'm using a fast SSD to do swap. scan_swap_map() sometimes uses up to 20~30%
> > CPU time (when cluster is hard to find, the CPU time can be up to 80%), which
> > becomes a bottleneck.  scan_swap_map() scans a byte array to search a 256 page
> > cluster, which is very slow.
> > 
> > Here I introduced a simple algorithm to search cluster. Since we only care
> > about 256 pages cluster, we can just use a counter to track if a cluster is
> > free. Every 256 pages use one int to store the counter. If the counter of a
> > cluster is 0, the cluster is free. All free clusters will be added to a list,
> > so searching cluster is very efficient. With this, scap_swap_map() overhead
> > disappears.
> > 
> > This might help low end SD card swap too. Because if the cluster is aligned, SD
> > firmware can do flash erase more efficiently.
> > 
> > We only enable the algorithm for SSD. Hard disk swap isn't fast enough and has
> > downside with the algorithm which might introduce regression (see below).
> > 
> > The patch slightly changes which cluster is choosen. It always adds free
> > cluster to list tail. This can help wear leveling for low end SSD too. And if
> > no cluster found, the scan_swap_map() will do search from the end of last
> > cluster. So if no cluster found, the scan_swap_map() will do search from the
> > end of last free cluster, which is random. For SSD, this isn't a problem at
> > all.
> > 
> > Another downside is the cluster must be aligned to 256 pages, which will reduce
> > the chance to find a cluster. I would expect this isn't a big problem for SSD
> > because of the non-seek penality. (And this is the reason I only enable the
> > algorithm for SSD).
> 
> I have to agree with Will here - the patch adds a significant new
> design/algorithm into core MM but there wasn't even an attempt to
> describe it within the code.
> 
> The changelog provdes a reasonable overview, most notably the second
> paragraph.  Could you please find a way to flesh that part out a bit
> then integrate it into a code comment?  And yes, the major functions
> should have their own comments explaining how they serve the overall
> scheme.

Alright, I'll add more document as possible in the code instead of the change log.
 
> > --- linux.orig/include/linux/swap.h	2013-07-11 19:14:36.849910383 +0800
> > +++ linux/include/linux/swap.h	2013-07-11 19:14:38.657887654 +0800
> > @@ -182,6 +182,17 @@ enum {
> >  #define SWAP_MAP_SHMEM	0xbf	/* Owned by shmem/tmpfs, in first swap_map */
> >  
> >  /*
> > + * the data field stores next cluster if the cluster is free or cluster counter
> > + * otherwise
> > + */
> > +struct swap_cluster_info {
> > +	unsigned int data:24;
> > +	unsigned int flags:8;
> > +};
> 
> If I'm understanding it correctly, the code and data structures which
> this patch adds are all protected by swap_info_struct.lock, yes?  This
> is also worth mentioning in a comment, perhaps at the swap_cluster_info
> definition site
> 
> > +#define CLUSTER_FLAG_FREE 1 /* This cluster is free */
> > +#define CLUSTER_FLAG_NEXT_NULL 2 /* This cluster has no next cluster */
> >
> > ...
> >
> > @@ -2117,13 +2311,28 @@ SYSCALL_DEFINE2(swapon, const char __use
> >  		error = -ENOMEM;
> >  		goto bad_swap;
> >  	}
> > +	if (p->bdev && blk_queue_nonrot(bdev_get_queue(p->bdev))) {
> > +		p->flags |= SWP_SOLIDSTATE;
> > +		/*
> > +		 * select a random position to start with to help wear leveling
> > +		 * SSD
> > +		 */
> > +		p->cluster_next = 1 + (prandom_u32() % p->highest_bit);
> > +
> > +		cluster_info = vzalloc(DIV_ROUND_UP(maxpages,
> > +			SWAPFILE_CLUSTER) * sizeof(*cluster_info));
> 
> OK, what is the upper bound on the size of this allocation?
> 
> A failure here would be bad - perhaps a list is needed, rather than a
> flat array.

Not too much. The cluster_info will be one int every 256 pages so for 1T swap
partition, we will use 4M memory. A list will waste memory and hard to use in
this case because we need get the cluster_info according to page index.

Thanks,
Shaohua 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2013-07-18 10:33 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-07-15 20:43 Shaohua Li
2013-07-17  7:38 ` Will Huck
2013-07-17 22:00 ` Andrew Morton
2013-07-18 10:33   ` Shaohua Li [this message]
2013-07-22 10:04 Shaohua Li

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130718103310.GA25547@kernel.org \
    --to=shli@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=aquini@redhat.com \
    --cc=hughd@google.com \
    --cc=kmpark@infradead.org \
    --cc=linux-mm@kvack.org \
    --cc=minchan@kernel.org \
    --cc=riel@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox