[RFC] Slab: Enforce clean node lists per zone, add policy support and fallback

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC] Slab: Enforce clean node lists per zone, add policy support and fallback
@ 2006-08-08 16:56 Christoph Lameter
  2006-08-08 18:16 ` Paul Jackson
  2006-08-08 18:29 ` Paul Jackson
  0 siblings, 2 replies; 5+ messages in thread
From: Christoph Lameter @ 2006-08-08 16:56 UTC (permalink / raw)
  To: linux-mm; +Cc: Pekka Enberg, kiran, ak

There are certainly issues for non-NUMA at this point and also we need to 
check how the slab behavior changes when memory gets low.


This patch insures that the slab node lists only contain slabs that
belong to that specific node. All slab allocations use __GFP_THISNODE
when calling into the page allocator. If an allocation fails then
we fall back in the slab allocator according to the zonelists
appropriate for a certain context.

Currently the allocations may be redirected via cpusets to other nodes. 
This results in remote pages on nodelists and that in turn results in 
interrupt latency issues during cache draining. Plus the slab is handing 
out memory as local when it is really remote.

Fallback for slab memory allocations therefore occurs within the slab
allocator and not in the page allocator. This is necessary in order
to be able to use the existing pools of objects on the nodes that
we fall back to before adding more pages to a slab.

The fallback function insures that the nodes we fall back to obey
cpuset restrictions of the current context. We do not allocate
slabs outside of the current cpuset context like before.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.18-rc3-mm2/mm/slab.c
===================================================================
--- linux-2.6.18-rc3-mm2.orig/mm/slab.c	2006-08-08 09:45:56.472181039 -0700
+++ linux-2.6.18-rc3-mm2/mm/slab.c	2006-08-08 09:47:43.582735013 -0700
@@ -2226,7 +2226,7 @@ kmem_cache_create (const char *name, siz
 	cachep->colour = left_over / cachep->colour_off;
 	cachep->slab_size = slab_size;
 	cachep->flags = flags;
-	cachep->gfpflags = 0;
+	cachep->gfpflags = __GFP_THISNODE | __GFP_NORETRY | __GFP_NOWARN;
 	if (flags & SLAB_CACHE_DMA)
 		cachep->gfpflags |= GFP_DMA;
 	cachep->buffer_size = size;
@@ -3049,6 +3049,11 @@ static __always_inline void *__cache_all
 
 	local_irq_save(save_flags);
 	objp = ____cache_alloc(cachep, flags);
+#ifdef CONFIG_NUMA
+	/* __cache_alloc_node knows how to locate memory on other nodes */
+	if (!objp)
+		objp = __cache_alloc_node(cachep, flags, numa_node_id());
+#endif
 	local_irq_restore(save_flags);
 	objp = cache_alloc_debugcheck_after(cachep, flags, objp,
 					    caller);
@@ -3067,7 +3072,7 @@ static void *alternate_node_alloc(struct
 {
 	int nid_alloc, nid_here;
 
-	if (in_interrupt())
+	if (in_interrupt() || (flags & __GFP_THISNODE))
 		return NULL;
 	nid_alloc = nid_here = numa_node_id();
 	if (cpuset_do_slab_mem_spread() && (cachep->flags & SLAB_MEM_SPREAD))
@@ -3083,6 +3088,27 @@ static void *alternate_node_alloc(struct
 }
 
 /*
+ * Fall back function if there was no memory availabel and no objects on a
+ * certain node and we are allowed to fall back. We mimick the behavior of
+ * the page allocator. We fall back according to a zonelist determined by
+ * the policy layer while obeying cpuset constraints.
+ */
+void *fallback_alloc(struct kmem_cache *cache, gfp_t flags)
+{
+	struct zonelist *zonelist = mpol_zonelist(flags, 0, NULL, 0);
+	struct zone **z;
+	void *obj = NULL;
+
+	for (z = zonelist->zones; *z && !obj; z++)
+		if (zone_idx(*z) == ZONE_NORMAL &&
+				cpuset_zone_allowed(*z, flags))
+			obj = __cache_alloc_node(cache,
+					flags | __GFP_THISNODE,
+					(*z)->zone_pgdat->node_id);
+	return obj;
+}
+
+/*
  * A interface to enable slab creation on nodeid
  */
 static void *__cache_alloc_node(struct kmem_cache *cachep, gfp_t flags,
@@ -3135,11 +3161,15 @@ retry:
 must_grow:
 	spin_unlock(&l3->list_lock);
 	x = cache_grow(cachep, flags, nodeid);
+	if (x)
+		goto retry;
 
-	if (!x)
-		return NULL;
+	if (!(flags & __GFP_THISNODE))
+		/* Unable to grow the cache. Fall back to other nodes. */
+		return fallback_alloc(cachep, flags);
+
+	return NULL;
 
-	goto retry;
 done:
 	return obj;
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC] Slab: Enforce clean node lists per zone, add policy support and fallback
  2006-08-08 16:56 [RFC] Slab: Enforce clean node lists per zone, add policy support and fallback Christoph Lameter
@ 2006-08-08 18:16 ` Paul Jackson
  2006-08-08 18:32   ` Christoph Lameter
  2006-08-08 18:29 ` Paul Jackson
  1 sibling, 1 reply; 5+ messages in thread
From: Paul Jackson @ 2006-08-08 18:16 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-mm, penberg, kiran, ak

Christoph wrote:
> Currently the allocations may be redirected via cpusets to other nodes. 

Minor picky point of terminology ... I wouldn't say that cpusets
"redirect" the allocation, but "force" or "constrain" it.  To my way
of speaking, a "redirect" would apply if the rule was "allocations
on node 6 should be satisfied on (redirected to) node 9", for
example.  A forced constraint applies if the rule is "I don't care
what you asked for buddy - you're getting node 9, period."

Separate point - I think we already have a workaround in place for
the migration case to keep cpuset constraints out of the way.  See
the overwriting of tsk->mems_allowed in the kernel/cpuset.c routine
cpuset_migrate_mm().  With Christoph's new __GFP_THISNODE, or whatever
alloc_pages_exact_node() with limited zonelist equivalent we come up
with, we don't need both that and the cpuset_migrate_mm() workaround.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC] Slab: Enforce clean node lists per zone, add policy support and fallback
  2006-08-08 16:56 [RFC] Slab: Enforce clean node lists per zone, add policy support and fallback Christoph Lameter
  2006-08-08 18:16 ` Paul Jackson
@ 2006-08-08 18:29 ` Paul Jackson
  1 sibling, 0 replies; 5+ messages in thread
From: Paul Jackson @ 2006-08-08 18:29 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-mm, penberg, kiran, ak

Rather than special casing __cache_alloc_node() to handle the
fallback to other nodes when __GFP_THISNODE was -not- set, it might be
clearer to go the custom, single node zonelist (MPOL_BIND-like?)
approach, with no __GFP_THISNODE flag, for the few calls that do
require exact node placement.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC] Slab: Enforce clean node lists per zone, add policy support and fallback
  2006-08-08 18:16 ` Paul Jackson
@ 2006-08-08 18:32   ` Christoph Lameter
  2006-08-08 19:20     ` Paul Jackson
  0 siblings, 1 reply; 5+ messages in thread
From: Christoph Lameter @ 2006-08-08 18:32 UTC (permalink / raw)
  To: Paul Jackson; +Cc: linux-mm, penberg, kiran, ak

On Tue, 8 Aug 2006, Paul Jackson wrote:

> Separate point - I think we already have a workaround in place for
> the migration case to keep cpuset constraints out of the way.  See
> the overwriting of tsk->mems_allowed in the kernel/cpuset.c routine
> cpuset_migrate_mm().  With Christoph's new __GFP_THISNODE, or whatever
> alloc_pages_exact_node() with limited zonelist equivalent we come up
> with, we don't need both that and the cpuset_migrate_mm() workaround.

You are confusing two issues in the migration code. The case of 
sys_migrate_page was fixed by you by changing the cpuset context. Thats 
fine and we do not need __GFP_THISNODE there because the page are to be 
allocated in conformity with a cpuset context of a process.

In the case of sys_move_pages we move individual pages to particular 
nodes. There we do not want to have cpuset redirection by constraints or 
mempolicy influences.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC] Slab: Enforce clean node lists per zone, add policy support and fallback
  2006-08-08 18:32   ` Christoph Lameter
@ 2006-08-08 19:20     ` Paul Jackson
  0 siblings, 0 replies; 5+ messages in thread
From: Paul Jackson @ 2006-08-08 19:20 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-mm, penberg, kiran, ak

> You are confusing two issues in the migration code. The case of 
> sys_migrate_page was fixed by you by changing the cpuset context. Thats 
> fine and we do not need __GFP_THISNODE there because the page are to be 
> allocated in conformity with a cpuset context of a process.

Ah - ok.  Yes, I was looking for the constraints of the new, destination
cpuset, rather than the tasks curent cpuset.  And I was not looking for
exact __GFP_THISNODE placement.  So I was talking about a separate case.

Minor confusion with a confusion ... I don't know why you mentioned
'sys_migrate_page' -- that wasn't what I was referring to.  I was
referring to my cpuset_migrate_mm() hack, which is involved in the two
cases:
    1) a task is put in a cpuset that is marked 'memory_migrate', or
    2) a task is in a cpuset marked 'memory_migrate' and that cpusets
       'mems' are changed.

In any case ... nevermind ;).

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2006-08-08 19:20 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-08-08 16:56 [RFC] Slab: Enforce clean node lists per zone, add policy support and fallback Christoph Lameter
2006-08-08 18:16 ` Paul Jackson
2006-08-08 18:32   ` Christoph Lameter
2006-08-08 19:20     ` Paul Jackson
2006-08-08 18:29 ` Paul Jackson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox