Re: [PATCH 0/5] make slab gfp fair

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Peter Zijlstra <a.p.zijlstra@chello.nl>
To: Christoph Lameter <clameter@sgi.com>
Cc: Matt Mackall <mpm@selenic.com>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	Thomas Graf <tgraf@suug.ch>, David Miller <davem@davemloft.net>,
	Andrew Morton <akpm@linux-foundation.org>,
	Daniel Phillips <phillips@google.com>,
	Pekka Enberg <penberg@cs.helsinki.fi>
Subject: Re: [PATCH 0/5] make slab gfp fair
Date: Wed, 16 May 2007 22:18:58 +0200	[thread overview]
Message-ID: <1179346738.2912.39.camel@lappy> (raw)
In-Reply-To: <Pine.LNX.4.64.0705161235490.10660@schroedinger.engr.sgi.com>

On Wed, 2007-05-16 at 12:53 -0700, Christoph Lameter wrote:
> On Wed, 16 May 2007, Peter Zijlstra wrote:
> 
> > If this 4k cpu system ever gets to touch the new lock it is in way
> > deeper problems than a bouncing cache-line.
> 
> So its no use on NUMA?

It is, its just that we're swapping very heavily at that point, a
bouncing cache-line will not significantly slow down the box compared to
waiting for block IO, will it?

> > Please look at it more carefully.
> > 
> > We differentiate pages allocated at the level where GFP_ATOMIC starts to
> > fail. By not updating the percpu slabs those are retried every time,
> > except for ALLOC_NO_WATERMARKS allocations; those are served from the
> > ->reserve_slab.
> > 
> > Once a regular slab allocation succeeds again, the ->reserve_slab is
> > cleaned up and never again looked at it until we're in distress again.
> 
> A single slab? This may only give you a a single object in an extreme 
> case. Are you sure that this solution is generic enough?

Well, single as in a single active; it gets spilled into the full list
and a new one is instanciated if more is needed.

> The problem here is that you may spinlock and take out the slab for one 
> cpu but then (AFAICT) other cpus can still not get their high priority 
> allocs satisfied. Some comments follow.

All cpus are redirected to ->reserve_slab when the regular allocations
start to fail.

> > Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> > ---
> >  include/linux/slub_def.h |    2 +
> >  mm/slub.c                |   85 ++++++++++++++++++++++++++++++++++++++++++-----
> >  2 files changed, 78 insertions(+), 9 deletions(-)
> > 
> > Index: linux-2.6-git/include/linux/slub_def.h
> > ===================================================================
> > --- linux-2.6-git.orig/include/linux/slub_def.h
> > +++ linux-2.6-git/include/linux/slub_def.h
> > @@ -46,6 +46,8 @@ struct kmem_cache {
> >  	struct list_head list;	/* List of slab caches */
> >  	struct kobject kobj;	/* For sysfs */
> >  
> > +	struct page *reserve_slab;
> > +
> >  #ifdef CONFIG_NUMA
> >  	int defrag_ratio;
> >  	struct kmem_cache_node *node[MAX_NUMNODES];
> > Index: linux-2.6-git/mm/slub.c
> > ===================================================================
> > --- linux-2.6-git.orig/mm/slub.c
> > +++ linux-2.6-git/mm/slub.c
> > @@ -20,11 +20,13 @@
> >  #include <linux/mempolicy.h>
> >  #include <linux/ctype.h>
> >  #include <linux/kallsyms.h>
> > +#include "internal.h"
> >  
> >  /*
> >   * Lock order:
> > - *   1. slab_lock(page)
> > - *   2. slab->list_lock
> > + *   1. reserve_lock
> > + *   2. slab_lock(page)
> > + *   3. node->list_lock
> >   *
> >   *   The slab_lock protects operations on the object of a particular
> >   *   slab and its metadata in the page struct. If the slab lock
> > @@ -259,6 +261,8 @@ static int sysfs_slab_alias(struct kmem_
> >  static void sysfs_slab_remove(struct kmem_cache *s) {}
> >  #endif
> >  
> > +static DEFINE_SPINLOCK(reserve_lock);
> > +
> >  /********************************************************************
> >   * 			Core slab cache functions
> >   *******************************************************************/
> > @@ -1007,7 +1011,7 @@ static void setup_object(struct kmem_cac
> >  		s->ctor(object, s, 0);
> >  }
> >  
> > -static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node)
> > +static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node, int *rank)
> >  {
> >  	struct page *page;
> >  	struct kmem_cache_node *n;
> > @@ -1025,6 +1029,7 @@ static struct page *new_slab(struct kmem
> >  	if (!page)
> >  		goto out;
> >  
> > +	*rank = page->rank;
> >  	n = get_node(s, page_to_nid(page));
> >  	if (n)
> >  		atomic_long_inc(&n->nr_slabs);
> > @@ -1311,7 +1316,7 @@ static void unfreeze_slab(struct kmem_ca
> >  /*
> >   * Remove the cpu slab
> >   */
> > -static void deactivate_slab(struct kmem_cache *s, struct page *page, int cpu)
> > +static void __deactivate_slab(struct kmem_cache *s, struct page *page)
> >  {
> >  	/*
> >  	 * Merge cpu freelist into freelist. Typically we get here
> > @@ -1330,10 +1335,15 @@ static void deactivate_slab(struct kmem_
> >  		page->freelist = object;
> >  		page->inuse--;
> >  	}
> > -	s->cpu_slab[cpu] = NULL;
> >  	unfreeze_slab(s, page);
> >  }
> 
> So you want to spill back the lockless_freelist without deactivating the 
> slab? Why are you using the lockless_freelist at all? If you do not use it 
> then you can call unfreeze_slab. No need for this split.

Ah, quite right. I do indeed not use the lockless_freelist.

> > @@ -1395,6 +1405,7 @@ static void *__slab_alloc(struct kmem_ca
> >  {
> >  	void **object;
> >  	int cpu = smp_processor_id();
> > +	int rank = 0;
> >  
> >  	if (!page)
> >  		goto new_slab;
> > @@ -1424,10 +1435,26 @@ new_slab:
> >  	if (page) {
> >  		s->cpu_slab[cpu] = page;
> >  		goto load_freelist;
> > -	}
> > +	} else if (unlikely(gfp_to_alloc_flags(gfpflags) & ALLOC_NO_WATERMARKS))
> > +		goto try_reserve;
> 
> Ok so we are trying to allocate a slab and do not get one thus -> 
> try_reserve. 

Right, so the cpu-slab is NULL, and we need a new slab.

> But this is only working if we are using the slab after
> explicitly flushing the cpuslabs. Otherwise the slab may be full and we
> get to alloc_slab.

/me fails to parse.

When we need a new_slab: 
 - we try the partial lists,
 - we try the reserve (if ALLOC_NO_WATERMARKS)
   otherwise alloc_slab

> >  
> > -	page = new_slab(s, gfpflags, node);
> > -	if (page) {
> 
> > +alloc_slab:
> > +	page = new_slab(s, gfpflags, node, &rank);
> > +	if (page && rank) {
> 
> Huh? You mean !page?

No, no, we did get a page, and it was !ALLOC_NO_WATERMARK hard to get
it. 

> > +		if (unlikely(s->reserve_slab)) {
> > +			struct page *reserve;
> > +
> > +			spin_lock(&reserve_lock);
> > +			reserve = s->reserve_slab;
> > +			s->reserve_slab = NULL;
> > +			spin_unlock(&reserve_lock);
> > +
> > +			if (reserve) {
> > +				slab_lock(reserve);
> > +				__deactivate_slab(s, reserve);
> > +				putback_slab(s, reserve);
> 
> Remove the above two lines (they are wrong regardless) and simply make 
> this the cpu slab.

It need not be the same node; the reserve_slab is node agnostic.
So here the free page watermarks are good again, and we can forget all
about the ->reserve_slab. We just push it on the free/partial lists and
forget about it.

But like you said above: unfreeze_slab() should be good, since I don't
use the lockless_freelist.

> > +			}
> > +		}
> >  		cpu = smp_processor_id();
> >  		if (s->cpu_slab[cpu]) {
> >  			/*
> > @@ -1455,6 +1482,18 @@ new_slab:
> >  		SetSlabFrozen(page);
> >  		s->cpu_slab[cpu] = page;
> >  		goto load_freelist;
> > +	} else if (page) {
> > +		spin_lock(&reserve_lock);
> > +		if (s->reserve_slab) {
> > +			discard_slab(s, page);
> > +			page = s->reserve_slab;
> > +		}
> > +		slab_lock(page);
> > +		SetPageActive(page);
> > +		s->reserve_slab = page;
> > +		spin_unlock(&reserve_lock);
> > +
> > +		goto got_reserve;

So this is when we get a page and it was ALLOC_NO_WATERMARKS hard to get
it. Instead of updating the cpu_slab we leave that unset, so that
subsequent allocations will try to allocate a slab again thereby testing
the current free pages limit (and not gobble up the reserve).

> >  	}
> >  	return NULL;
> >  debug:
> > @@ -1470,6 +1509,31 @@ debug:
> >  	page->freelist = object[page->offset];
> >  	slab_unlock(page);
> >  	return object;
> > +
> > +try_reserve:
> > +	spin_lock(&reserve_lock);
> > +	page = s->reserve_slab;
> > +	if (!page) {
> > +		spin_unlock(&reserve_lock);
> > +		goto alloc_slab;
> > +	}
> > +
> > +	slab_lock(page);
> > +	if (!page->freelist) {
> > +		s->reserve_slab = NULL;
> > +		spin_unlock(&reserve_lock);
> > +		__deactivate_slab(s, page);
> replace with unfreeze slab.
> 
> > +		putback_slab(s, page);
> 
> Putting back the slab twice.

__deactivete_slab() doesn't do putback_slab, and now I see that whole
function isn't there anymore. unfreeze_slab() it is.

> > +		goto alloc_slab;
> > +	}
> > +	spin_unlock(&reserve_lock);
> > +
> > +got_reserve:
> > +	object = page->freelist;
> > +	page->inuse++;
> > +	page->freelist = object[page->offset];
> > +	slab_unlock(page);
> > +	return object;



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2007-05-16 20:18 UTC|newest]

Thread overview: 69+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-05-14 13:19 Peter Zijlstra
2007-05-14 13:19 ` [PATCH 1/5] mm: page allocation rank Peter Zijlstra
2007-05-14 13:19 ` [PATCH 2/5] mm: slab allocation fairness Peter Zijlstra
2007-05-14 15:51   ` Christoph Lameter
2007-05-14 13:19 ` [PATCH 3/5] mm: slub " Peter Zijlstra
2007-05-14 15:49   ` Christoph Lameter
2007-05-14 16:14     ` Peter Zijlstra
2007-05-14 16:35       ` Christoph Lameter
2007-05-14 13:19 ` [PATCH 4/5] mm: slob " Peter Zijlstra
2007-05-14 13:19 ` [PATCH 5/5] mm: allow mempool to fall back to memalloc reserves Peter Zijlstra
2007-05-14 15:53 ` [PATCH 0/5] make slab gfp fair Christoph Lameter
2007-05-14 16:10   ` Peter Zijlstra
2007-05-14 16:37     ` Christoph Lameter
2007-05-14 16:12   ` Matt Mackall
2007-05-14 16:29     ` Christoph Lameter
2007-05-14 17:40       ` Peter Zijlstra
2007-05-14 17:57         ` Christoph Lameter
2007-05-14 19:28           ` Peter Zijlstra
2007-05-14 19:56             ` Christoph Lameter
2007-05-14 20:03               ` Peter Zijlstra
2007-05-14 20:06                 ` Christoph Lameter
2007-05-14 20:12                   ` Peter Zijlstra
2007-05-14 20:25                 ` Christoph Lameter
2007-05-15 17:27             ` Peter Zijlstra
2007-05-15 22:02               ` Christoph Lameter
2007-05-16  6:59                 ` Peter Zijlstra
2007-05-16 18:43                   ` Christoph Lameter
2007-05-16 19:25                     ` Peter Zijlstra
2007-05-16 19:53                       ` Christoph Lameter
2007-05-16 20:18                         ` Peter Zijlstra [this message]
2007-05-16 20:27                           ` Christoph Lameter
2007-05-16 20:40                             ` Peter Zijlstra
2007-05-16 20:44                               ` Christoph Lameter
2007-05-16 20:54                                 ` Peter Zijlstra
2007-05-16 20:59                                   ` Christoph Lameter
2007-05-16 21:04                                     ` Peter Zijlstra
2007-05-16 21:13                                       ` Christoph Lameter
2007-05-16 21:20                                         ` Peter Zijlstra
2007-05-16 21:42                                           ` Christoph Lameter
2007-05-17  7:28                                             ` Peter Zijlstra
2007-05-17 17:30                                               ` Christoph Lameter
2007-05-17 17:53                                                 ` Peter Zijlstra
2007-05-17 18:01                                                   ` Christoph Lameter
2007-05-14 19:44     ` Andrew Morton
2007-05-14 20:01       ` Matt Mackall
2007-05-14 20:05       ` Peter Zijlstra
2007-05-17  3:02 ` Christoph Lameter
2007-05-17  7:08   ` Peter Zijlstra
2007-05-17 17:29     ` Christoph Lameter
2007-05-17 17:52       ` Peter Zijlstra
2007-05-17 17:59         ` Christoph Lameter
2007-05-17 17:53       ` Matt Mackall
2007-05-17 18:02         ` Christoph Lameter
2007-05-17 19:18           ` Peter Zijlstra
2007-05-17 19:24             ` Christoph Lameter
2007-05-17 21:26               ` Peter Zijlstra
2007-05-17 21:44                 ` Paul Jackson
2007-05-17 22:27                 ` Christoph Lameter
2007-05-18  9:54                   ` Peter Zijlstra
2007-05-18 17:11                     ` Paul Jackson
2007-05-18 17:11                     ` Christoph Lameter
2007-05-20  8:39                       ` Peter Zijlstra
2007-05-21 16:45                         ` Christoph Lameter
2007-05-21 19:33                           ` Peter Zijlstra
2007-05-21 19:43                             ` Christoph Lameter
2007-05-21 20:08                               ` Peter Zijlstra
2007-05-21 20:32                                 ` Christoph Lameter
2007-05-21 20:54                                   ` Peter Zijlstra
2007-05-21 21:04                                     ` Christoph Lameter

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1179346738.2912.39.camel@lappy \
    --to=a.p.zijlstra@chello.nl \
    --cc=akpm@linux-foundation.org \
    --cc=clameter@sgi.com \
    --cc=davem@davemloft.net \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mpm@selenic.com \
    --cc=penberg@cs.helsinki.fi \
    --cc=phillips@google.com \
    --cc=tgraf@suug.ch \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox