[RFC] Make the slab allocator observe NUMA policies

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC] Make the slab allocator observe NUMA policies
@ 2005-11-10 22:04 Christoph Lameter
  2005-11-11  3:06 ` Andi Kleen
  0 siblings, 1 reply; 11+ messages in thread
From: Christoph Lameter @ 2005-11-10 22:04 UTC (permalink / raw)
  To: ak, steiner, linux-mm, alokk

Currently the slab allocator simply allocates slabs from the current node
or from the node indicated in kmalloc_node().

This change came about with the NUMA slab allocator changes in 2.6.14.
Before 2.6.14 the slab allocator was obeying memory policies in the sense
that the pages were allocated in the policy context of the currently executing
process (which could allocate a page according to MPOL_INTERLEAVE for one
process and then use the free entries in that page for another process
that did not have this policy set).

The following patch adds NUMA memory policy support. This means that the
slab entries (and therefore also the pages containing them) will be allocated
according to memory policy.

This is in particular of importance during bootup when the default 
memory policy is set to MPOL_INTERLEAVE. For 2.6.13 and earlier this meant 
that the slab allocator got its pages from all nodes. 2.6.14 will 
allocate only from the boot node causing an unbalanced memory setup when 
bootup is complete.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.14-mm1/mm/slab.c
===================================================================
--- linux-2.6.14-mm1.orig/mm/slab.c	2005-11-10 13:00:11.000000000 -0800
+++ linux-2.6.14-mm1/mm/slab.c	2005-11-10 13:01:55.000000000 -0800
@@ -103,6 +103,7 @@
 #include	<linux/rcupdate.h>
 #include	<linux/string.h>
 #include	<linux/nodemask.h>
+#include	<linux/mempolicy.h>
 
 #include	<asm/uaccess.h>
 #include	<asm/cacheflush.h>
@@ -2526,11 +2527,22 @@ cache_alloc_debugcheck_after(kmem_cache_
 #define cache_alloc_debugcheck_after(a,b,objp,d) (objp)
 #endif
 
+static void *__cache_alloc_node(kmem_cache_t *, gfp_t, int);
+
 static inline void *____cache_alloc(kmem_cache_t *cachep, gfp_t flags)
 {
 	void* objp;
 	struct array_cache *ac;
 
+#ifdef CONFIG_NUMA
+	if (current->mempolicy) {
+		int nid = next_slab_node(current->mempolicy);
+
+		if (nid != numa_node_id())
+			return __cache_alloc_node(cachep, flags, nid);
+	}
+#endif
+
 	check_irq_off();
 	ac = ac_data(cachep);
 	if (likely(ac->avail)) {
Index: linux-2.6.14-mm1/mm/mempolicy.c
===================================================================
--- linux-2.6.14-mm1.orig/mm/mempolicy.c	2005-11-09 10:47:15.000000000 -0800
+++ linux-2.6.14-mm1/mm/mempolicy.c	2005-11-10 13:01:55.000000000 -0800
@@ -988,6 +988,31 @@ static unsigned interleave_nodes(struct 
 	return nid;
 }
 
+/*
+ * Depending on the memory policy provide a node from which to allocate the
+ * next slab entry.
+ */
+unsigned next_slab_node(struct mempolicy *policy)
+{
+	switch (policy->policy) {
+	case MPOL_INTERLEAVE:
+		return interleave_nodes(policy);
+
+	case MPOL_BIND:
+		/*
+		 * Follow bind policy behavior and start allocation at the
+		 * first node.
+		 */
+		return policy->v.zonelist->zones[0]->zone_pgdat->node_id;
+
+	case MPOL_PREFERRED:
+		return policy->v.preferred_node;
+
+	default:
+		return numa_node_id();
+	}
+}
+
 /* Do static interleaving for a VMA with known offset. */
 static unsigned offset_il_node(struct mempolicy *pol,
 		struct vm_area_struct *vma, unsigned long off)
Index: linux-2.6.14-mm1/include/linux/mempolicy.h
===================================================================
--- linux-2.6.14-mm1.orig/include/linux/mempolicy.h	2005-11-09 10:47:09.000000000 -0800
+++ linux-2.6.14-mm1/include/linux/mempolicy.h	2005-11-10 13:01:55.000000000 -0800
@@ -158,6 +158,7 @@ extern void numa_default_policy(void);
 extern void numa_policy_init(void);
 extern void numa_policy_rebind(const nodemask_t *old, const nodemask_t *new);
 extern struct mempolicy default_policy;
+extern unsigned next_slab_node(struct mempolicy *policy);
 
 int do_migrate_pages(struct mm_struct *mm,
 	const nodemask_t *from_nodes, const nodemask_t *to_nodes, int flags);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] Make the slab allocator observe NUMA policies
  2005-11-10 22:04 [RFC] Make the slab allocator observe NUMA policies Christoph Lameter
@ 2005-11-11  3:06 ` Andi Kleen
  2005-11-11 17:40   ` Christoph Lameter
  0 siblings, 1 reply; 11+ messages in thread
From: Andi Kleen @ 2005-11-11  3:06 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: steiner, linux-mm, alokk

On Thursday 10 November 2005 23:04, Christoph Lameter wrote:
> Currently the slab allocator simply allocates slabs from the current node
> or from the node indicated in kmalloc_node().
>
> This change came about with the NUMA slab allocator changes in 2.6.14.
> Before 2.6.14 the slab allocator was obeying memory policies in the sense
> that the pages were allocated in the policy context of the currently
> executing process (which could allocate a page according to MPOL_INTERLEAVE
> for one process and then use the free entries in that page for another
> process that did not have this policy set).
>
> The following patch adds NUMA memory policy support. This means that the
> slab entries (and therefore also the pages containing them) will be
> allocated according to memory policy.

You're adding a check and potential cache line miss to a really really hot 
path. I would prefer  it to do the policy check only in the slower path of 
slab that gets memory from the backing page allocator. While not 100% exact 
this should be  good enough for just spreading memory around during 
initialization. And I cannot really think of any other uses of this.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] Make the slab allocator observe NUMA policies
  2005-11-11  3:06 ` Andi Kleen
@ 2005-11-11 17:40   ` Christoph Lameter
  2005-11-13 11:22     ` Andi Kleen
  0 siblings, 1 reply; 11+ messages in thread
From: Christoph Lameter @ 2005-11-11 17:40 UTC (permalink / raw)
  To: Andi Kleen; +Cc: steiner, linux-mm, alokk

On Fri, 11 Nov 2005, Andi Kleen wrote:

> > The following patch adds NUMA memory policy support. This means that the
> > slab entries (and therefore also the pages containing them) will be
> > allocated according to memory policy.
> 
> You're adding a check and potential cache line miss to a really really hot 
> path. I would prefer  it to do the policy check only in the slower path of 
> slab that gets memory from the backing page allocator. While not 100% exact 
> this should be  good enough for just spreading memory around during 
> initialization. And I cannot really think of any other uses of this.

Hmm. Thats not easy to do since the slab allocator is managing the pages 
in terms of the nodes where they are located. The whole thing is geared to 
first inspect the lists for one node and then expand if no page is 
available.

The cacheline already in use by the page allocator, the page allocator 
will continually reference current->mempolicy. See alloc_page_vma and 
alloc_pages_current. So its likely that the cacheline is already active 
and the impact on the hot code patch is likely negligible.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] Make the slab allocator observe NUMA policies
  2005-11-11 17:40   ` Christoph Lameter
@ 2005-11-13 11:22     ` Andi Kleen
  2005-11-14 18:05       ` Christoph Lameter
  0 siblings, 1 reply; 11+ messages in thread
From: Andi Kleen @ 2005-11-13 11:22 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: steiner, linux-mm, alokk

On Friday 11 November 2005 18:40, Christoph Lameter wrote:

> Hmm. Thats not easy to do since the slab allocator is managing the pages 
> in terms of the nodes where they are located. The whole thing is geared to 
> first inspect the lists for one node and then expand if no page is 
> available.

Yes, that's fine - as long as it doesn't allocate too many 
pages at one go (which it doesn't) then the interleaving should
even the allocations out at page level.

> The cacheline already in use by the page allocator, the page allocator 
> will continually reference current->mempolicy. See alloc_page_vma and 
> alloc_pages_current. So its likely that the cacheline is already active 
> and the impact on the hot code patch is likely negligible.

I don't think that's likely - frequent users of kmem_cache_alloc don't
call alloc_pages. That is why we have slow and fast paths for this ...
But if we keep adding all the features of slow paths to fast paths
then the fast paths will be eventually not be fast anymore.
 
-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] Make the slab allocator observe NUMA policies
  2005-11-13 11:22     ` Andi Kleen
@ 2005-11-14 18:05       ` Christoph Lameter
  2005-11-14 18:44         ` Andi Kleen
  0 siblings, 1 reply; 11+ messages in thread
From: Christoph Lameter @ 2005-11-14 18:05 UTC (permalink / raw)
  To: Andi Kleen; +Cc: steiner, linux-mm, alokk

On Sun, 13 Nov 2005, Andi Kleen wrote:

> On Friday 11 November 2005 18:40, Christoph Lameter wrote:
> 
> > Hmm. Thats not easy to do since the slab allocator is managing the pages 
> > in terms of the nodes where they are located. The whole thing is geared to 
> > first inspect the lists for one node and then expand if no page is 
> > available.
> 
> Yes, that's fine - as long as it doesn't allocate too many 
> pages at one go (which it doesn't) then the interleaving should
> even the allocations out at page level.

The slab allocator may allocate pages higher orders which need to 
be physically continuous. 

Any idea how to push this to the page allocation within the slab without 
rearchitecting the thing?

> > The cacheline already in use by the page allocator, the page allocator 
> > will continually reference current->mempolicy. See alloc_page_vma and 
> > alloc_pages_current. So its likely that the cacheline is already active 
> > and the impact on the hot code patch is likely negligible.
> 
> I don't think that's likely - frequent users of kmem_cache_alloc don't
> call alloc_pages. That is why we have slow and fast paths for this ...
> But if we keep adding all the features of slow paths to fast paths
> then the fast paths will be eventually not be fast anymore.

IMHO, the application allocating memory is highly likely to call other 
memory allocation function at the same time. small cache operations are 
typically related to page sized allocations.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] Make the slab allocator observe NUMA policies
  2005-11-14 18:05       ` Christoph Lameter
@ 2005-11-14 18:44         ` Andi Kleen
  2005-11-14 19:08           ` Christoph Lameter
  0 siblings, 1 reply; 11+ messages in thread
From: Andi Kleen @ 2005-11-14 18:44 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: steiner, linux-mm, alokk

On Monday 14 November 2005 19:05, Christoph Lameter wrote:
> On Sun, 13 Nov 2005, Andi Kleen wrote:
> 
> > On Friday 11 November 2005 18:40, Christoph Lameter wrote:
> > 
> > > Hmm. Thats not easy to do since the slab allocator is managing the pages 
> > > in terms of the nodes where they are located. The whole thing is geared to 
> > > first inspect the lists for one node and then expand if no page is 
> > > available.
> > 
> > Yes, that's fine - as long as it doesn't allocate too many 
> > pages at one go (which it doesn't) then the interleaving should
> > even the allocations out at page level.
> 
> The slab allocator may allocate pages higher orders which need to 
> be physically continuous. 

> Any idea how to push this to the page allocation within the slab without 
> rearchitecting the thing?

I believe that's only a small fraction of the allocations, for where
the slabs are big enough to be an significant part of the page.

Proof: VM breaks down with higher orders. If slab would use them
all the time it would break down too. It doesn't. Q.E.D ;-)

Also looking at the objsize colum in /proc/slabinfo most slabs are
significantly smaller than a page and the higher kmalloc slabs don't have 
too many objects, so slab shouldn't do this too often.

You're right they're a problem, but perhaps they can be just ignored
(like if they are <20% of the allocations the inbalance resulting
from them might not be too bad)

Another way (as a backup option) would be to RR them as higher order pages, 
but that would need new special code.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] Make the slab allocator observe NUMA policies
  2005-11-14 18:44         ` Andi Kleen
@ 2005-11-14 19:08           ` Christoph Lameter
  2005-11-15  3:34             ` Andi Kleen
  0 siblings, 1 reply; 11+ messages in thread
From: Christoph Lameter @ 2005-11-14 19:08 UTC (permalink / raw)
  To: Andi Kleen; +Cc: steiner, linux-mm, alokk

On Mon, 14 Nov 2005, Andi Kleen wrote:

> > Any idea how to push this to the page allocation within the slab without 
> > rearchitecting the thing?
> 
> I believe that's only a small fraction of the allocations, for where
> the slabs are big enough to be an significant part of the page.
> 
> Proof: VM breaks down with higher orders. If slab would use them
> all the time it would break down too. It doesn't. Q.E.D ;-)

Yes the higher order pages are rare. However, regular sized pages are 
frequent and the allocations for these pages always consult 
task->mempolicy.

> Another way (as a backup option) would be to RR them as higher order pages, 
> but that would need new special code.

The proposed patch RRs higher order pages as configured by the memory 
policy.

The other fundamental problem that I mentioned remains: 

The slab allocator is designed in such a way that it needs to know the 
node for the allocation before it does its work. This is because the 
nodelists are per node since 2.6.14. You wanted to do the policy 
application on the back end so after all the work is done (presumably 
for the current node) and after the node specific lists have been 
examined. Policy application at that point may find that another
node than the current node was desired and the whole thing has to be 
redone for the other node. This will significantly negatively impact
the performance of the slab allocator in particular if the current node
is is unlikely to be chosen for the memory policy.

I have thought about various ways to modify kmem_getpages() but these do 
not fit into the basic current concept of the slab allocator. The 
proposed method is the cleanest approach that I can think of. I'd be glad 
if you could come up with something different but AFAIK simply moving the 
policy application down in the slab allocator does not work.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] Make the slab allocator observe NUMA policies
  2005-11-14 19:08           ` Christoph Lameter
@ 2005-11-15  3:34             ` Andi Kleen
  2005-11-15 16:43               ` Christoph Lameter
  0 siblings, 1 reply; 11+ messages in thread
From: Andi Kleen @ 2005-11-15  3:34 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: steiner, linux-mm, alokk

On Monday 14 November 2005 20:08, Christoph Lameter wrote:

> The slab allocator is designed in such a way that it needs to know the 
> node for the allocation before it does its work. This is because the 
> nodelists are per node since 2.6.14. You wanted to do the policy 
> application on the back end so after all the work is done (presumably 
> for the current node) and after the node specific lists have been 
> examined. Policy application at that point may find that another
> node than the current node was desired and the whole thing has to be 
> redone for the other node. This will significantly negatively impact
> the performance of the slab allocator in particular if the current node
> is is unlikely to be chosen for the memory policy.
> 
> I have thought about various ways to modify kmem_getpages() but these do 
> not fit into the basic current concept of the slab allocator. The 
> proposed method is the cleanest approach that I can think of. I'd be glad 
> if you could come up with something different but AFAIK simply moving the 
> policy application down in the slab allocator does not work.

I haven't checked all the details, but why can't it be done at the cache_grow
layer? (that's already a slow path)

If it's not possible to do it in the slow path I would say the design is 
incompatible with interleaving then. Better not do it then than doing it wrong.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] Make the slab allocator observe NUMA policies
  2005-11-15  3:34             ` Andi Kleen
@ 2005-11-15 16:43               ` Christoph Lameter
  2005-11-15 16:51                 ` Andi Kleen
  0 siblings, 1 reply; 11+ messages in thread
From: Christoph Lameter @ 2005-11-15 16:43 UTC (permalink / raw)
  To: Andi Kleen; +Cc: steiner, linux-mm, alokk

On Tue, 15 Nov 2005, Andi Kleen wrote:

> On Monday 14 November 2005 20:08, Christoph Lameter wrote:
> > I have thought about various ways to modify kmem_getpages() but these do 
> > not fit into the basic current concept of the slab allocator. The 
> > proposed method is the cleanest approach that I can think of. I'd be glad 
> > if you could come up with something different but AFAIK simply moving the 
> > policy application down in the slab allocator does not work.
> 
> I haven't checked all the details, but why can't it be done at the cache_grow
> layer? (that's already a slow path)

cache_grow is called only after the lists have been checked. Its the same
scenario as I described.

> If it's not possible to do it in the slow path I would say the design is 
> incompatible with interleaving then. Better not do it then than doing it wrong.

If MPOL_INTERLEAVE  is set then multiple kmalloc() invocations will 
allocate each item round robin on each node. That is the intended function 
of MPOL_INTERLEAVE right?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] Make the slab allocator observe NUMA policies
  2005-11-15 16:43               ` Christoph Lameter
@ 2005-11-15 16:51                 ` Andi Kleen
  2005-11-15 16:55                   ` Christoph Lameter
  0 siblings, 1 reply; 11+ messages in thread
From: Andi Kleen @ 2005-11-15 16:51 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: steiner, linux-mm, alokk

On Tuesday 15 November 2005 17:43, Christoph Lameter wrote:
> On Tue, 15 Nov 2005, Andi Kleen wrote:
> > On Monday 14 November 2005 20:08, Christoph Lameter wrote:
> > > I have thought about various ways to modify kmem_getpages() but these
> > > do not fit into the basic current concept of the slab allocator. The
> > > proposed method is the cleanest approach that I can think of. I'd be
> > > glad if you could come up with something different but AFAIK simply
> > > moving the policy application down in the slab allocator does not work.
> >
> > I haven't checked all the details, but why can't it be done at the
> > cache_grow layer? (that's already a slow path)
>
> cache_grow is called only after the lists have been checked. Its the same
> scenario as I described.

So retry the check?

>
> > If it's not possible to do it in the slow path I would say the design is
> > incompatible with interleaving then. Better not do it then than doing it
> > wrong.
>
> If MPOL_INTERLEAVE  is set then multiple kmalloc() invocations will
> allocate each item round robin on each node. That is the intended function
> of MPOL_INTERLEAVE right?

memory policy was always only designed to work on pages, not on smaller
objects. So no.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] Make the slab allocator observe NUMA policies
  2005-11-15 16:51                 ` Andi Kleen
@ 2005-11-15 16:55                   ` Christoph Lameter
  0 siblings, 0 replies; 11+ messages in thread
From: Christoph Lameter @ 2005-11-15 16:55 UTC (permalink / raw)
  To: Andi Kleen; +Cc: steiner, linux-mm, alokk, akpm

On Tue, 15 Nov 2005, Andi Kleen wrote:

> > > I haven't checked all the details, but why can't it be done at the
> > > cache_grow layer? (that's already a slow path)
> >
> > cache_grow is called only after the lists have been checked. Its the same
> > scenario as I described.
> 
> So retry the check?

The checks are quit extensive there is locking going on etc. No easy way 
back. And this is easily going to offset what you see as negative in the 
proposed patch.

> > > If it's not possible to do it in the slow path I would say the design is
> > > incompatible with interleaving then. Better not do it then than doing it
> > > wrong.
> >
> > If MPOL_INTERLEAVE  is set then multiple kmalloc() invocations will
> > allocate each item round robin on each node. That is the intended function
> > of MPOL_INTERLEAVE right?
> 
> memory policy was always only designed to work on pages, not on smaller
> objects. So no.

memory policy works on huge pages in SLES9, so it already works on larger 
objects. Why should it not also work on smaller objects?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2005-11-15 16:55 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-11-10 22:04 [RFC] Make the slab allocator observe NUMA policies Christoph Lameter
2005-11-11  3:06 ` Andi Kleen
2005-11-11 17:40   ` Christoph Lameter
2005-11-13 11:22     ` Andi Kleen
2005-11-14 18:05       ` Christoph Lameter
2005-11-14 18:44         ` Andi Kleen
2005-11-14 19:08           ` Christoph Lameter
2005-11-15  3:34             ` Andi Kleen
2005-11-15 16:43               ` Christoph Lameter
2005-11-15 16:51                 ` Andi Kleen
2005-11-15 16:55                   ` Christoph Lameter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox