linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: David Rientjes <rientjes@google.com>
To: Nick Piggin <npiggin@suse.de>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>,
	Andi Kleen <andi@firstfloor.org>,
	Christoph Lameter <cl@linux-foundation.org>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	haicheng.li@intel.com,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Subject: Re: [patch] slab: add memory hotplug support
Date: Sat, 27 Mar 2010 19:13:44 -0700 (PDT)	[thread overview]
Message-ID: <alpine.DEB.2.00.1003271849260.7249@chino.kir.corp.google.com> (raw)
In-Reply-To: <20100309134633.GM8653@laptop>

On Wed, 10 Mar 2010, Nick Piggin wrote:

> On Mon, Mar 08, 2010 at 03:19:48PM -0800, David Rientjes wrote:
> > On Fri, 5 Mar 2010, Nick Piggin wrote:
> > 
> > > > +#if defined(CONFIG_NUMA) && defined(CONFIG_MEMORY_HOTPLUG)
> > > > +/*
> > > > + * Drains and frees nodelists for a node on each slab cache, used for memory
> > > > + * hotplug.  Returns -EBUSY if all objects cannot be drained on memory
> > > > + * hot-remove so that the node is not removed.  When used because memory
> > > > + * hot-add is canceled, the only result is the freed kmem_list3.
> > > > + *
> > > > + * Must hold cache_chain_mutex.
> > > > + */
> > > > +static int __meminit free_cache_nodelists_node(int node)
> > > > +{
> > > > +	struct kmem_cache *cachep;
> > > > +	int ret = 0;
> > > > +
> > > > +	list_for_each_entry(cachep, &cache_chain, next) {
> > > > +		struct array_cache *shared;
> > > > +		struct array_cache **alien;
> > > > +		struct kmem_list3 *l3;
> > > > +
> > > > +		l3 = cachep->nodelists[node];
> > > > +		if (!l3)
> > > > +			continue;
> > > > +
> > > > +		spin_lock_irq(&l3->list_lock);
> > > > +		shared = l3->shared;
> > > > +		if (shared) {
> > > > +			free_block(cachep, shared->entry, shared->avail, node);
> > > > +			l3->shared = NULL;
> > > > +		}
> > > > +		alien = l3->alien;
> > > > +		l3->alien = NULL;
> > > > +		spin_unlock_irq(&l3->list_lock);
> > > > +
> > > > +		if (alien) {
> > > > +			drain_alien_cache(cachep, alien);
> > > > +			free_alien_cache(alien);
> > > > +		}
> > > > +		kfree(shared);
> > > > +
> > > > +		drain_freelist(cachep, l3, l3->free_objects);
> > > > +		if (!list_empty(&l3->slabs_full) ||
> > > > +					!list_empty(&l3->slabs_partial)) {
> > > > +			/*
> > > > +			 * Continue to iterate through each slab cache to free
> > > > +			 * as many nodelists as possible even though the
> > > > +			 * offline will be canceled.
> > > > +			 */
> > > > +			ret = -EBUSY;
> > > > +			continue;
> > > > +		}
> > > > +		kfree(l3);
> > > > +		cachep->nodelists[node] = NULL;
> > > 
> > > What's stopping races of other CPUs trying to access l3 and array
> > > caches while they're being freed?
> > > 
> > 
> > numa_node_id() will not return an offlined nodeid and cache_alloc_node() 
> > already does a fallback to other onlined nodes in case a nodeid is passed 
> > to kmalloc_node() that does not have a nodelist.  l3->shared and l3->alien 
> > cannot be accessed without l3->list_lock (drain, cache_alloc_refill, 
> > cache_flusharray) or cache_chain_mutex (kmem_cache_destroy, cache_reap).
> 
> Yeah, but can't it _have_ a nodelist (ie. before it is set to NULL here)
> while it is being accessed by another CPU and concurrently being freed
> on this one? 
> 

You're right, we can't free cachep->nodelists[node] for any node that is 
being hot-removed to avoid a race in cache_alloc_node().  I thought we had 
protection for this under cache_chain_mutex for most dereferences and 
could disregard cache_alloc_refill() because numa_node_id() would never 
return a node being removed under memory hotplug, that would be the 
responsibility of cpu hotplug instead (offline the cpu first, then ensure 
numa_node_id() can't return a node under hot-remove).

Thanks for pointing that out, it's definitely broken here.

As an alternative, I think we should do something like this on 
MEM_GOING_OFFLINE:

	int ret = 0;

	mutex_lock(&cache_chain_mutex);
	list_for_each_entry(cachep, &cache_chain, next) {
		struct kmem_list3 *l3;

		l3 = cachep->nodelists[node];
		if (!l3)
			continue;
		drain_freelist(cachep, l3, l3->free_objects);

		ret = list_empty(&l3->slabs_full) &&
		      list_empty(&l3->slabs_partial);
		if (ret)
			break;
	}
	mutex_unlock(&cache_chain_mutex);
	return ret ? NOTIFY_BAD : NOTIFY_OK;

to preempt hot-remove of a node where there are slabs on the partial or 
free list that can't be freed.

Then, for MEM_OFFLINE, we leave cachep->nodelists[node] to be valid in 
case there are cache_alloc_node() racers or the node ever comes back 
online; susbequent callers to kmalloc_node() for the offlined node would 
actually return objects from fallback_alloc() since kmem_getpages() would 
fail for a node without present pages.

If slab is allocated after the drain_freelist() above, we'll never 
actually get MEM_OFFLINE since all pages can't be isolated for memory 
hot-remove, thus, the node will never be offlined.  kmem_getpages() can't 
allocate isolated pages, so this race must happen after drain_freelist() 
and prior to the pageblock being isolated.

So the MEM_GOING_OFFLINE check above is really more of a convenience to 
short-circuit the hot-remove if we know we can't free all slab on that 
node to avoid all the subsequent work that would happen only to run into 
isolation failure later.

We don't need to do anything for MEM_CANCEL_OFFLINE since the only affect 
of MEM_GOING_OFFLINE is to drain the freelist.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  parent reply	other threads:[~2010-03-28  2:13 UTC|newest]

Thread overview: 85+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-02-11 20:53 [PATCH] [0/4] Update slab memory hotplug series Andi Kleen
2010-02-11 20:54 ` [PATCH] [1/4] SLAB: Handle node-not-up case in fallback_alloc() v2 Andi Kleen
2010-02-11 21:41   ` David Rientjes
2010-02-11 21:55     ` Andi Kleen
2010-02-15  6:04   ` Nick Piggin
2010-02-15 10:07     ` Andi Kleen
2010-02-15 10:22       ` Nick Piggin
2010-02-11 20:54 ` [PATCH] [2/4] SLAB: Separate node initialization into separate function Andi Kleen
2010-02-11 21:44   ` David Rientjes
2010-02-11 20:54 ` [PATCH] [3/4] SLAB: Set up the l3 lists for the memory of freshly added memory v2 Andi Kleen
2010-02-11 21:45   ` David Rientjes
2010-02-15  6:06     ` Nick Piggin
2010-02-15 21:47       ` David Rientjes
2010-02-16 14:04         ` Nick Piggin
2010-02-16 20:45           ` Pekka Enberg
2010-02-11 20:54 ` [PATCH] [4/4] SLAB: Fix node add timer race in cache_reap Andi Kleen
2010-02-11 21:45   ` David Rientjes
2010-02-15  6:15   ` Nick Piggin
2010-02-15 10:32     ` Andi Kleen
2010-02-15 10:41       ` Nick Piggin
2010-02-15 10:52         ` Andi Kleen
2010-02-15 11:01           ` Nick Piggin
2010-02-15 15:30             ` Andi Kleen
2010-02-19 18:22             ` Christoph Lameter
2010-02-20  9:01               ` Andi Kleen
2010-02-22 10:53                 ` Pekka Enberg
2010-02-22 14:31                   ` Andi Kleen
2010-02-22 16:11                     ` Pekka Enberg
2010-02-22 20:20                       ` Andi Kleen
2010-02-24 15:49                 ` Christoph Lameter
2010-02-25  7:26                   ` Pekka Enberg
2010-02-25  8:01                     ` David Rientjes
2010-02-25 18:30                       ` Christoph Lameter
2010-02-25 21:45                         ` David Rientjes
2010-02-25 22:31                           ` Christoph Lameter
2010-02-26 10:45                             ` Pekka Enberg
2010-02-26 11:43                               ` Andi Kleen
2010-02-26 12:35                                 ` Pekka Enberg
2010-02-26 14:08                                   ` Andi Kleen
2010-02-26  1:09                         ` KAMEZAWA Hiroyuki
2010-02-26 11:41                         ` Andi Kleen
2010-02-26 15:04                           ` Christoph Lameter
2010-02-26 15:05                             ` Christoph Lameter
2010-02-26 15:59                               ` Andi Kleen
2010-02-26 15:57                             ` Andi Kleen
2010-02-26 17:24                               ` Christoph Lameter
2010-02-26 17:31                                 ` Andi Kleen
2010-03-01  1:59                                   ` KAMEZAWA Hiroyuki
2010-03-01 10:27                                     ` David Rientjes
2010-02-27  0:01                                 ` David Rientjes
2010-03-01 10:24                                   ` [patch] slab: add memory hotplug support David Rientjes
2010-03-02  5:53                                     ` Pekka Enberg
2010-03-02 20:20                                       ` Christoph Lameter
2010-03-02 21:03                                         ` David Rientjes
2010-03-03  1:28                                         ` KAMEZAWA Hiroyuki
2010-03-03  2:39                                           ` David Rientjes
2010-03-03  2:51                                             ` KAMEZAWA Hiroyuki
2010-03-02 12:53                                     ` Andi Kleen
2010-03-02 15:04                                       ` Pekka Enberg
2010-03-03 14:34                                         ` Andi Kleen
2010-03-03 15:46                                           ` Christoph Lameter
2010-03-02 21:17                                       ` David Rientjes
2010-03-05  6:20                                     ` Nick Piggin
2010-03-05 12:47                                       ` Anca Emanuel
2010-03-05 13:58                                         ` Anca Emanuel
2010-03-05 14:11                                         ` Christoph Lameter
2010-03-08  3:06                                           ` Andi Kleen
2010-03-08  2:58                                         ` Andi Kleen
2010-03-08 23:19                                       ` David Rientjes
2010-03-09 13:46                                         ` Nick Piggin
2010-03-22 17:28                                           ` Pekka Enberg
2010-03-22 21:12                                             ` Nick Piggin
2010-03-28  2:13                                           ` David Rientjes [this message]
2010-03-28  2:40                                             ` [patch v2] " David Rientjes
2010-03-30  9:01                                               ` Pekka Enberg
2010-03-30 16:43                                                 ` Christoph Lameter
2010-04-04 20:45                                                   ` David Rientjes
2010-04-07 16:29                                               ` Pekka Enberg
2010-02-25 18:34                     ` [PATCH] [4/4] SLAB: Fix node add timer race in cache_reap Christoph Lameter
2010-02-25 18:46                       ` Pekka Enberg
2010-02-25 19:19                         ` Christoph Lameter
2010-03-02 12:55                         ` Andi Kleen
2010-02-19 18:22       ` Christoph Lameter
2010-02-22 10:57         ` Pekka Enberg
2010-02-13 10:24 ` [PATCH] [0/4] Update slab memory hotplug series Pekka Enberg

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=alpine.DEB.2.00.1003271849260.7249@chino.kir.corp.google.com \
    --to=rientjes@google.com \
    --cc=andi@firstfloor.org \
    --cc=cl@linux-foundation.org \
    --cc=haicheng.li@intel.com \
    --cc=kamezawa.hiroyu@jp.fujitsu.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=npiggin@suse.de \
    --cc=penberg@cs.helsinki.fi \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox