* [RFC] Make the slab allocator observe NUMA policies
@ 2005-11-10 22:04 Christoph Lameter
2005-11-11 3:06 ` Andi Kleen
0 siblings, 1 reply; 11+ messages in thread
From: Christoph Lameter @ 2005-11-10 22:04 UTC (permalink / raw)
To: ak, steiner, linux-mm, alokk
Currently the slab allocator simply allocates slabs from the current node
or from the node indicated in kmalloc_node().
This change came about with the NUMA slab allocator changes in 2.6.14.
Before 2.6.14 the slab allocator was obeying memory policies in the sense
that the pages were allocated in the policy context of the currently executing
process (which could allocate a page according to MPOL_INTERLEAVE for one
process and then use the free entries in that page for another process
that did not have this policy set).
The following patch adds NUMA memory policy support. This means that the
slab entries (and therefore also the pages containing them) will be allocated
according to memory policy.
This is in particular of importance during bootup when the default
memory policy is set to MPOL_INTERLEAVE. For 2.6.13 and earlier this meant
that the slab allocator got its pages from all nodes. 2.6.14 will
allocate only from the boot node causing an unbalanced memory setup when
bootup is complete.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Index: linux-2.6.14-mm1/mm/slab.c
===================================================================
--- linux-2.6.14-mm1.orig/mm/slab.c 2005-11-10 13:00:11.000000000 -0800
+++ linux-2.6.14-mm1/mm/slab.c 2005-11-10 13:01:55.000000000 -0800
@@ -103,6 +103,7 @@
#include <linux/rcupdate.h>
#include <linux/string.h>
#include <linux/nodemask.h>
+#include <linux/mempolicy.h>
#include <asm/uaccess.h>
#include <asm/cacheflush.h>
@@ -2526,11 +2527,22 @@ cache_alloc_debugcheck_after(kmem_cache_
#define cache_alloc_debugcheck_after(a,b,objp,d) (objp)
#endif
+static void *__cache_alloc_node(kmem_cache_t *, gfp_t, int);
+
static inline void *____cache_alloc(kmem_cache_t *cachep, gfp_t flags)
{
void* objp;
struct array_cache *ac;
+#ifdef CONFIG_NUMA
+ if (current->mempolicy) {
+ int nid = next_slab_node(current->mempolicy);
+
+ if (nid != numa_node_id())
+ return __cache_alloc_node(cachep, flags, nid);
+ }
+#endif
+
check_irq_off();
ac = ac_data(cachep);
if (likely(ac->avail)) {
Index: linux-2.6.14-mm1/mm/mempolicy.c
===================================================================
--- linux-2.6.14-mm1.orig/mm/mempolicy.c 2005-11-09 10:47:15.000000000 -0800
+++ linux-2.6.14-mm1/mm/mempolicy.c 2005-11-10 13:01:55.000000000 -0800
@@ -988,6 +988,31 @@ static unsigned interleave_nodes(struct
return nid;
}
+/*
+ * Depending on the memory policy provide a node from which to allocate the
+ * next slab entry.
+ */
+unsigned next_slab_node(struct mempolicy *policy)
+{
+ switch (policy->policy) {
+ case MPOL_INTERLEAVE:
+ return interleave_nodes(policy);
+
+ case MPOL_BIND:
+ /*
+ * Follow bind policy behavior and start allocation at the
+ * first node.
+ */
+ return policy->v.zonelist->zones[0]->zone_pgdat->node_id;
+
+ case MPOL_PREFERRED:
+ return policy->v.preferred_node;
+
+ default:
+ return numa_node_id();
+ }
+}
+
/* Do static interleaving for a VMA with known offset. */
static unsigned offset_il_node(struct mempolicy *pol,
struct vm_area_struct *vma, unsigned long off)
Index: linux-2.6.14-mm1/include/linux/mempolicy.h
===================================================================
--- linux-2.6.14-mm1.orig/include/linux/mempolicy.h 2005-11-09 10:47:09.000000000 -0800
+++ linux-2.6.14-mm1/include/linux/mempolicy.h 2005-11-10 13:01:55.000000000 -0800
@@ -158,6 +158,7 @@ extern void numa_default_policy(void);
extern void numa_policy_init(void);
extern void numa_policy_rebind(const nodemask_t *old, const nodemask_t *new);
extern struct mempolicy default_policy;
+extern unsigned next_slab_node(struct mempolicy *policy);
int do_migrate_pages(struct mm_struct *mm,
const nodemask_t *from_nodes, const nodemask_t *to_nodes, int flags);
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 11+ messages in thread* Re: [RFC] Make the slab allocator observe NUMA policies
2005-11-10 22:04 [RFC] Make the slab allocator observe NUMA policies Christoph Lameter
@ 2005-11-11 3:06 ` Andi Kleen
2005-11-11 17:40 ` Christoph Lameter
0 siblings, 1 reply; 11+ messages in thread
From: Andi Kleen @ 2005-11-11 3:06 UTC (permalink / raw)
To: Christoph Lameter; +Cc: steiner, linux-mm, alokk
On Thursday 10 November 2005 23:04, Christoph Lameter wrote:
> Currently the slab allocator simply allocates slabs from the current node
> or from the node indicated in kmalloc_node().
>
> This change came about with the NUMA slab allocator changes in 2.6.14.
> Before 2.6.14 the slab allocator was obeying memory policies in the sense
> that the pages were allocated in the policy context of the currently
> executing process (which could allocate a page according to MPOL_INTERLEAVE
> for one process and then use the free entries in that page for another
> process that did not have this policy set).
>
> The following patch adds NUMA memory policy support. This means that the
> slab entries (and therefore also the pages containing them) will be
> allocated according to memory policy.
You're adding a check and potential cache line miss to a really really hot
path. I would prefer it to do the policy check only in the slower path of
slab that gets memory from the backing page allocator. While not 100% exact
this should be good enough for just spreading memory around during
initialization. And I cannot really think of any other uses of this.
-Andi
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC] Make the slab allocator observe NUMA policies
2005-11-11 3:06 ` Andi Kleen
@ 2005-11-11 17:40 ` Christoph Lameter
2005-11-13 11:22 ` Andi Kleen
0 siblings, 1 reply; 11+ messages in thread
From: Christoph Lameter @ 2005-11-11 17:40 UTC (permalink / raw)
To: Andi Kleen; +Cc: steiner, linux-mm, alokk
On Fri, 11 Nov 2005, Andi Kleen wrote:
> > The following patch adds NUMA memory policy support. This means that the
> > slab entries (and therefore also the pages containing them) will be
> > allocated according to memory policy.
>
> You're adding a check and potential cache line miss to a really really hot
> path. I would prefer it to do the policy check only in the slower path of
> slab that gets memory from the backing page allocator. While not 100% exact
> this should be good enough for just spreading memory around during
> initialization. And I cannot really think of any other uses of this.
Hmm. Thats not easy to do since the slab allocator is managing the pages
in terms of the nodes where they are located. The whole thing is geared to
first inspect the lists for one node and then expand if no page is
available.
The cacheline already in use by the page allocator, the page allocator
will continually reference current->mempolicy. See alloc_page_vma and
alloc_pages_current. So its likely that the cacheline is already active
and the impact on the hot code patch is likely negligible.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC] Make the slab allocator observe NUMA policies
2005-11-11 17:40 ` Christoph Lameter
@ 2005-11-13 11:22 ` Andi Kleen
2005-11-14 18:05 ` Christoph Lameter
0 siblings, 1 reply; 11+ messages in thread
From: Andi Kleen @ 2005-11-13 11:22 UTC (permalink / raw)
To: Christoph Lameter; +Cc: steiner, linux-mm, alokk
On Friday 11 November 2005 18:40, Christoph Lameter wrote:
> Hmm. Thats not easy to do since the slab allocator is managing the pages
> in terms of the nodes where they are located. The whole thing is geared to
> first inspect the lists for one node and then expand if no page is
> available.
Yes, that's fine - as long as it doesn't allocate too many
pages at one go (which it doesn't) then the interleaving should
even the allocations out at page level.
> The cacheline already in use by the page allocator, the page allocator
> will continually reference current->mempolicy. See alloc_page_vma and
> alloc_pages_current. So its likely that the cacheline is already active
> and the impact on the hot code patch is likely negligible.
I don't think that's likely - frequent users of kmem_cache_alloc don't
call alloc_pages. That is why we have slow and fast paths for this ...
But if we keep adding all the features of slow paths to fast paths
then the fast paths will be eventually not be fast anymore.
-Andi
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC] Make the slab allocator observe NUMA policies
2005-11-13 11:22 ` Andi Kleen
@ 2005-11-14 18:05 ` Christoph Lameter
2005-11-14 18:44 ` Andi Kleen
0 siblings, 1 reply; 11+ messages in thread
From: Christoph Lameter @ 2005-11-14 18:05 UTC (permalink / raw)
To: Andi Kleen; +Cc: steiner, linux-mm, alokk
On Sun, 13 Nov 2005, Andi Kleen wrote:
> On Friday 11 November 2005 18:40, Christoph Lameter wrote:
>
> > Hmm. Thats not easy to do since the slab allocator is managing the pages
> > in terms of the nodes where they are located. The whole thing is geared to
> > first inspect the lists for one node and then expand if no page is
> > available.
>
> Yes, that's fine - as long as it doesn't allocate too many
> pages at one go (which it doesn't) then the interleaving should
> even the allocations out at page level.
The slab allocator may allocate pages higher orders which need to
be physically continuous.
Any idea how to push this to the page allocation within the slab without
rearchitecting the thing?
> > The cacheline already in use by the page allocator, the page allocator
> > will continually reference current->mempolicy. See alloc_page_vma and
> > alloc_pages_current. So its likely that the cacheline is already active
> > and the impact on the hot code patch is likely negligible.
>
> I don't think that's likely - frequent users of kmem_cache_alloc don't
> call alloc_pages. That is why we have slow and fast paths for this ...
> But if we keep adding all the features of slow paths to fast paths
> then the fast paths will be eventually not be fast anymore.
IMHO, the application allocating memory is highly likely to call other
memory allocation function at the same time. small cache operations are
typically related to page sized allocations.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC] Make the slab allocator observe NUMA policies
2005-11-14 18:05 ` Christoph Lameter
@ 2005-11-14 18:44 ` Andi Kleen
2005-11-14 19:08 ` Christoph Lameter
0 siblings, 1 reply; 11+ messages in thread
From: Andi Kleen @ 2005-11-14 18:44 UTC (permalink / raw)
To: Christoph Lameter; +Cc: steiner, linux-mm, alokk
On Monday 14 November 2005 19:05, Christoph Lameter wrote:
> On Sun, 13 Nov 2005, Andi Kleen wrote:
>
> > On Friday 11 November 2005 18:40, Christoph Lameter wrote:
> >
> > > Hmm. Thats not easy to do since the slab allocator is managing the pages
> > > in terms of the nodes where they are located. The whole thing is geared to
> > > first inspect the lists for one node and then expand if no page is
> > > available.
> >
> > Yes, that's fine - as long as it doesn't allocate too many
> > pages at one go (which it doesn't) then the interleaving should
> > even the allocations out at page level.
>
> The slab allocator may allocate pages higher orders which need to
> be physically continuous.
> Any idea how to push this to the page allocation within the slab without
> rearchitecting the thing?
I believe that's only a small fraction of the allocations, for where
the slabs are big enough to be an significant part of the page.
Proof: VM breaks down with higher orders. If slab would use them
all the time it would break down too. It doesn't. Q.E.D ;-)
Also looking at the objsize colum in /proc/slabinfo most slabs are
significantly smaller than a page and the higher kmalloc slabs don't have
too many objects, so slab shouldn't do this too often.
You're right they're a problem, but perhaps they can be just ignored
(like if they are <20% of the allocations the inbalance resulting
from them might not be too bad)
Another way (as a backup option) would be to RR them as higher order pages,
but that would need new special code.
-Andi
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC] Make the slab allocator observe NUMA policies
2005-11-14 18:44 ` Andi Kleen
@ 2005-11-14 19:08 ` Christoph Lameter
2005-11-15 3:34 ` Andi Kleen
0 siblings, 1 reply; 11+ messages in thread
From: Christoph Lameter @ 2005-11-14 19:08 UTC (permalink / raw)
To: Andi Kleen; +Cc: steiner, linux-mm, alokk
On Mon, 14 Nov 2005, Andi Kleen wrote:
> > Any idea how to push this to the page allocation within the slab without
> > rearchitecting the thing?
>
> I believe that's only a small fraction of the allocations, for where
> the slabs are big enough to be an significant part of the page.
>
> Proof: VM breaks down with higher orders. If slab would use them
> all the time it would break down too. It doesn't. Q.E.D ;-)
Yes the higher order pages are rare. However, regular sized pages are
frequent and the allocations for these pages always consult
task->mempolicy.
> Another way (as a backup option) would be to RR them as higher order pages,
> but that would need new special code.
The proposed patch RRs higher order pages as configured by the memory
policy.
The other fundamental problem that I mentioned remains:
The slab allocator is designed in such a way that it needs to know the
node for the allocation before it does its work. This is because the
nodelists are per node since 2.6.14. You wanted to do the policy
application on the back end so after all the work is done (presumably
for the current node) and after the node specific lists have been
examined. Policy application at that point may find that another
node than the current node was desired and the whole thing has to be
redone for the other node. This will significantly negatively impact
the performance of the slab allocator in particular if the current node
is is unlikely to be chosen for the memory policy.
I have thought about various ways to modify kmem_getpages() but these do
not fit into the basic current concept of the slab allocator. The
proposed method is the cleanest approach that I can think of. I'd be glad
if you could come up with something different but AFAIK simply moving the
policy application down in the slab allocator does not work.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC] Make the slab allocator observe NUMA policies
2005-11-14 19:08 ` Christoph Lameter
@ 2005-11-15 3:34 ` Andi Kleen
2005-11-15 16:43 ` Christoph Lameter
0 siblings, 1 reply; 11+ messages in thread
From: Andi Kleen @ 2005-11-15 3:34 UTC (permalink / raw)
To: Christoph Lameter; +Cc: steiner, linux-mm, alokk
On Monday 14 November 2005 20:08, Christoph Lameter wrote:
> The slab allocator is designed in such a way that it needs to know the
> node for the allocation before it does its work. This is because the
> nodelists are per node since 2.6.14. You wanted to do the policy
> application on the back end so after all the work is done (presumably
> for the current node) and after the node specific lists have been
> examined. Policy application at that point may find that another
> node than the current node was desired and the whole thing has to be
> redone for the other node. This will significantly negatively impact
> the performance of the slab allocator in particular if the current node
> is is unlikely to be chosen for the memory policy.
>
> I have thought about various ways to modify kmem_getpages() but these do
> not fit into the basic current concept of the slab allocator. The
> proposed method is the cleanest approach that I can think of. I'd be glad
> if you could come up with something different but AFAIK simply moving the
> policy application down in the slab allocator does not work.
I haven't checked all the details, but why can't it be done at the cache_grow
layer? (that's already a slow path)
If it's not possible to do it in the slow path I would say the design is
incompatible with interleaving then. Better not do it then than doing it wrong.
-Andi
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC] Make the slab allocator observe NUMA policies
2005-11-15 3:34 ` Andi Kleen
@ 2005-11-15 16:43 ` Christoph Lameter
2005-11-15 16:51 ` Andi Kleen
0 siblings, 1 reply; 11+ messages in thread
From: Christoph Lameter @ 2005-11-15 16:43 UTC (permalink / raw)
To: Andi Kleen; +Cc: steiner, linux-mm, alokk
On Tue, 15 Nov 2005, Andi Kleen wrote:
> On Monday 14 November 2005 20:08, Christoph Lameter wrote:
> > I have thought about various ways to modify kmem_getpages() but these do
> > not fit into the basic current concept of the slab allocator. The
> > proposed method is the cleanest approach that I can think of. I'd be glad
> > if you could come up with something different but AFAIK simply moving the
> > policy application down in the slab allocator does not work.
>
> I haven't checked all the details, but why can't it be done at the cache_grow
> layer? (that's already a slow path)
cache_grow is called only after the lists have been checked. Its the same
scenario as I described.
> If it's not possible to do it in the slow path I would say the design is
> incompatible with interleaving then. Better not do it then than doing it wrong.
If MPOL_INTERLEAVE is set then multiple kmalloc() invocations will
allocate each item round robin on each node. That is the intended function
of MPOL_INTERLEAVE right?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC] Make the slab allocator observe NUMA policies
2005-11-15 16:43 ` Christoph Lameter
@ 2005-11-15 16:51 ` Andi Kleen
2005-11-15 16:55 ` Christoph Lameter
0 siblings, 1 reply; 11+ messages in thread
From: Andi Kleen @ 2005-11-15 16:51 UTC (permalink / raw)
To: Christoph Lameter; +Cc: steiner, linux-mm, alokk
On Tuesday 15 November 2005 17:43, Christoph Lameter wrote:
> On Tue, 15 Nov 2005, Andi Kleen wrote:
> > On Monday 14 November 2005 20:08, Christoph Lameter wrote:
> > > I have thought about various ways to modify kmem_getpages() but these
> > > do not fit into the basic current concept of the slab allocator. The
> > > proposed method is the cleanest approach that I can think of. I'd be
> > > glad if you could come up with something different but AFAIK simply
> > > moving the policy application down in the slab allocator does not work.
> >
> > I haven't checked all the details, but why can't it be done at the
> > cache_grow layer? (that's already a slow path)
>
> cache_grow is called only after the lists have been checked. Its the same
> scenario as I described.
So retry the check?
>
> > If it's not possible to do it in the slow path I would say the design is
> > incompatible with interleaving then. Better not do it then than doing it
> > wrong.
>
> If MPOL_INTERLEAVE is set then multiple kmalloc() invocations will
> allocate each item round robin on each node. That is the intended function
> of MPOL_INTERLEAVE right?
memory policy was always only designed to work on pages, not on smaller
objects. So no.
-Andi
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC] Make the slab allocator observe NUMA policies
2005-11-15 16:51 ` Andi Kleen
@ 2005-11-15 16:55 ` Christoph Lameter
0 siblings, 0 replies; 11+ messages in thread
From: Christoph Lameter @ 2005-11-15 16:55 UTC (permalink / raw)
To: Andi Kleen; +Cc: steiner, linux-mm, alokk, akpm
On Tue, 15 Nov 2005, Andi Kleen wrote:
> > > I haven't checked all the details, but why can't it be done at the
> > > cache_grow layer? (that's already a slow path)
> >
> > cache_grow is called only after the lists have been checked. Its the same
> > scenario as I described.
>
> So retry the check?
The checks are quit extensive there is locking going on etc. No easy way
back. And this is easily going to offset what you see as negative in the
proposed patch.
> > > If it's not possible to do it in the slow path I would say the design is
> > > incompatible with interleaving then. Better not do it then than doing it
> > > wrong.
> >
> > If MPOL_INTERLEAVE is set then multiple kmalloc() invocations will
> > allocate each item round robin on each node. That is the intended function
> > of MPOL_INTERLEAVE right?
>
> memory policy was always only designed to work on pages, not on smaller
> objects. So no.
memory policy works on huge pages in SLES9, so it already works on larger
objects. Why should it not also work on smaller objects?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2005-11-15 16:55 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-11-10 22:04 [RFC] Make the slab allocator observe NUMA policies Christoph Lameter
2005-11-11 3:06 ` Andi Kleen
2005-11-11 17:40 ` Christoph Lameter
2005-11-13 11:22 ` Andi Kleen
2005-11-14 18:05 ` Christoph Lameter
2005-11-14 18:44 ` Andi Kleen
2005-11-14 19:08 ` Christoph Lameter
2005-11-15 3:34 ` Andi Kleen
2005-11-15 16:43 ` Christoph Lameter
2005-11-15 16:51 ` Andi Kleen
2005-11-15 16:55 ` Christoph Lameter
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox