[PATCH] numa: mempolicy: dynamic interleave map for system init.

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] numa: mempolicy: dynamic interleave map for system init.
@ 2007-06-07  1:17 Paul Mundt
  2007-06-08  1:01 ` Andrew Morton
  0 siblings, 1 reply; 21+ messages in thread
From: Paul Mundt @ 2007-06-07  1:17 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, ak, clameter, hugh, lee.schermerhorn

This is an alternative approach to the MPOL_INTERLEAVE across online
nodes as the system init policy. Andi suggested it might be worthwhile
trying to do this dynamically rather than as a command line option, so
that's what this tries to do.

With this, the online nodes are sized and packed in to an interleave map
if they're large enough for interleave to be worthwhile. I arbitrarily
chose 16MB as the node size to enable interleaving, but perhaps someone
has a better figure in mind?

In the case where all of the nodes are smaller than that, the largest
node is selected and placed in to the map by itself (if they're all the
same size, the first online node gets used).

If people prefer this approach, the previous patch adding mpolinit can be
dropped.

Signed-off-by: Paul Mundt <lethal@linux-sh.org>

--

 mm/mempolicy.c |   31 ++++++++++++++++++++++++++++---
 1 file changed, 28 insertions(+), 3 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index d76e8eb..a67c8f1 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1597,6 +1597,10 @@ void mpol_free_shared_policy(struct shared_policy *p)
 /* assumes fs == KERNEL_DS */
 void __init numa_policy_init(void)
 {
+	nodemask_t interleave_nodes;
+	unsigned long largest = 0;
+	int nid, prefer = 0;
+
 	policy_cache = kmem_cache_create("numa_policy",
 					 sizeof(struct mempolicy),
 					 0, SLAB_PANIC, NULL, NULL);
@@ -1605,10 +1609,31 @@ void __init numa_policy_init(void)
 				     sizeof(struct sp_node),
 				     0, SLAB_PANIC, NULL, NULL);
 
-	/* Set interleaving policy for system init. This way not all
-	   the data structures allocated at system boot end up in node zero. */
+	/*
+	 * Set interleaving policy for system init. Interleaving is only
+	 * enabled across suitably sized nodes (default is >= 16MB), or
+	 * fall back to the largest node if they're all smaller.
+	 */
+	nodes_clear(interleave_nodes);
+	for_each_online_node(nid) {
+		unsigned long total_pages = node_present_pages(nid);
+
+		/* Preserve the largest node */
+		if (largest < total_pages) {
+			largest = total_pages;
+			prefer = nid;
+		}
+
+		/* Interleave this node? */
+		if ((total_pages << PAGE_SHIFT) >= (16 << 20))
+			node_set(nid, interleave_nodes);
+	}
+
+	/* All too small, use the largest */
+	if (unlikely(nodes_empty(interleave_nodes)))
+		node_set(prefer, interleave_nodes);
 
-	if (do_set_mempolicy(MPOL_INTERLEAVE, &node_online_map))
+	if (do_set_mempolicy(MPOL_INTERLEAVE, &interleave_nodes))
 		printk("numa_policy_init: interleaving failed\n");
 }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] numa: mempolicy: dynamic interleave map for system init.
  2007-06-07  1:17 [PATCH] numa: mempolicy: dynamic interleave map for system init Paul Mundt
@ 2007-06-08  1:01 ` Andrew Morton
  2007-06-08  2:47   ` Christoph Lameter
  0 siblings, 1 reply; 21+ messages in thread
From: Andrew Morton @ 2007-06-08  1:01 UTC (permalink / raw)
  To: Paul Mundt; +Cc: linux-mm, ak, clameter, hugh, lee.schermerhorn

On Thu, 7 Jun 2007 10:17:01 +0900
Paul Mundt <lethal@linux-sh.org> wrote:

> This is an alternative approach to the MPOL_INTERLEAVE across online
> nodes as the system init policy. Andi suggested it might be worthwhile
> trying to do this dynamically rather than as a command line option, so
> that's what this tries to do.
> 
> With this, the online nodes are sized and packed in to an interleave map
> if they're large enough for interleave to be worthwhile. I arbitrarily
> chose 16MB as the node size to enable interleaving, but perhaps someone
> has a better figure in mind?
> 
> In the case where all of the nodes are smaller than that, the largest
> node is selected and placed in to the map by itself (if they're all the
> same size, the first online node gets used).
> 
> If people prefer this approach, the previous patch adding mpolinit can be
> dropped.
> 
> Signed-off-by: Paul Mundt <lethal@linux-sh.org>

Well I took silence as assent.

None of the above text is suitable for a changelog.  Please send a
changelog for this patch, thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] numa: mempolicy: dynamic interleave map for system init.
  2007-06-08  1:01 ` Andrew Morton
@ 2007-06-08  2:47   ` Christoph Lameter
  2007-06-08  3:01     ` Andrew Morton
  2007-06-08  3:25     ` Paul Mundt
  0 siblings, 2 replies; 21+ messages in thread
From: Christoph Lameter @ 2007-06-08  2:47 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Paul Mundt, linux-mm, ak, hugh, lee.schermerhorn

On Thu, 7 Jun 2007, Andrew Morton wrote:

> Well I took silence as assent.

Well, grudgingly. How far are we willing to go to support these asymmetric 
setups? The NUMA code initially was designed for mostly symmetric systems 
with roughly the same amount of memory on each node. The farther we go 
from this the more options we will have to add special casing to deal with 
these imbalances.

With memoryless nodes we already have one issue that will ripple through 
the kernel likely requiring numerous modifications and special casing. 
Then we now have the ZONE_DMA issues reording the zonelists. Now we will 
support systems with 1MB size nodes? We will need to modify the slab 
allocators to only allocate on special processors?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] numa: mempolicy: dynamic interleave map for system init.
  2007-06-08  2:47   ` Christoph Lameter
@ 2007-06-08  3:01     ` Andrew Morton
  2007-06-08  3:11       ` Christoph Lameter
  2007-06-08  3:25     ` Paul Mundt
  1 sibling, 1 reply; 21+ messages in thread
From: Andrew Morton @ 2007-06-08  3:01 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Paul Mundt, linux-mm, ak, hugh, lee.schermerhorn

On Thu, 7 Jun 2007 19:47:09 -0700 (PDT) Christoph Lameter <clameter@sgi.com> wrote:

> On Thu, 7 Jun 2007, Andrew Morton wrote:
> 
> > Well I took silence as assent.
> 
> Well, grudgingly. How far are we willing to go to support these asymmetric 
> setups? The NUMA code initially was designed for mostly symmetric systems 
> with roughly the same amount of memory on each node. The farther we go 
> from this the more options we will have to add special casing to deal with 
> these imbalances.
> 
> With memoryless nodes we already have one issue that will ripple through 
> the kernel likely requiring numerous modifications and special casing. 
> Then we now have the ZONE_DMA issues reording the zonelists. Now we will 
> support systems with 1MB size nodes? We will need to modify the slab 
> allocators to only allocate on special processors?
> 

Failing to support memoryless nodes was a bug, and we should continue to
take bugfixes for that.

Dunno about the rest - it depends upon how real-world are the problems
which people hit, and upon how messy the fixes look.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] numa: mempolicy: dynamic interleave map for system init.
  2007-06-08  3:01     ` Andrew Morton
@ 2007-06-08  3:11       ` Christoph Lameter
  0 siblings, 0 replies; 21+ messages in thread
From: Christoph Lameter @ 2007-06-08  3:11 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Paul Mundt, linux-mm, ak, hugh, lee.schermerhorn

On Thu, 7 Jun 2007, Andrew Morton wrote:

> Failing to support memoryless nodes was a bug, and we should continue to
> take bugfixes for that.

We intentionally did support memoryless nodes on the arch level but not in 
the core in order to avoid these issues.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] numa: mempolicy: dynamic interleave map for system init.
  2007-06-08  2:47   ` Christoph Lameter
  2007-06-08  3:01     ` Andrew Morton
@ 2007-06-08  3:25     ` Paul Mundt
  2007-06-08  3:49       ` Christoph Lameter
  2007-06-08 14:50       ` Matt Mackall
  1 sibling, 2 replies; 21+ messages in thread
From: Paul Mundt @ 2007-06-08  3:25 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, linux-mm, ak, hugh, lee.schermerhorn, mpm

On Thu, Jun 07, 2007 at 07:47:09PM -0700, Christoph Lameter wrote:
> On Thu, 7 Jun 2007, Andrew Morton wrote:
> 
> > Well I took silence as assent.
> 
> Well, grudgingly. How far are we willing to go to support these asymmetric 
> setups? The NUMA code initially was designed for mostly symmetric systems 
> with roughly the same amount of memory on each node. The farther we go 
> from this the more options we will have to add special casing to deal with 
> these imbalances.
> 
Well, this doesn't all have to be dynamic either. I opted for the
mpolinit= approach first so we wouldn't make the accounting for the
common case heavier, but certainly having it dynamic is less hassle. The
asymmetric case will likely be the common case for embedded, but it's
obviously possible to try to work that in to SLOB or something similar,
if making SLUB or SLAB lighterweight and more tunable for these cases
ends up being a real barrier.

On the other hand, as we start having machines with multiple gigs of RAM
that are stashed in node 0 (with many smaller memories in other nodes),
SLOB isn't going to be a long-term option either.

The pgdat is already special cased for things like flatmem and memory
hotplug, throwing in something similar to scheduler domains in the pgdat
for node behavioural hints might be the least intrusive (and could be
ifdefed out for symmetric nodes).

> With memoryless nodes we already have one issue that will ripple through 
> the kernel likely requiring numerous modifications and special casing. 
> Then we now have the ZONE_DMA issues reording the zonelists. Now we will 
> support systems with 1MB size nodes? We will need to modify the slab 
> allocators to only allocate on special processors?
> 
Unfortunately CONFIG_NUMA deals with all of the problems that embedded
with multiple memories has (albeit perhaps somewhat heavy-handed), so
extending this seems to be a far more productive approach than
reinventing things. If we have to do this through a special allocator for
the asymmetric node case, so be it, but I don't expect the problem to go
away.

Even with just the mempolicy changes for dynamic interleave, a 128k or
512k node is already usable (despite slab and slub both chewing through a
good chunk of it).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] numa: mempolicy: dynamic interleave map for system init.
  2007-06-08  3:25     ` Paul Mundt
@ 2007-06-08  3:49       ` Christoph Lameter
  2007-06-08  4:13         ` Paul Mundt
  2007-06-08 14:50       ` Matt Mackall
  1 sibling, 1 reply; 21+ messages in thread
From: Christoph Lameter @ 2007-06-08  3:49 UTC (permalink / raw)
  To: Paul Mundt; +Cc: Andrew Morton, linux-mm, ak, hugh, lee.schermerhorn, mpm

On Fri, 8 Jun 2007, Paul Mundt wrote:

> obviously possible to try to work that in to SLOB or something similar,
> if making SLUB or SLAB lighterweight and more tunable for these cases
> ends up being a real barrier.

Its obviously possible and as far as I can tell the architecture you have 
there requires it to operate. But the question is how much special casing 
we will have to add to the core VM.

We would likely have to add a 

slub_nodes=

parameter that allows the specification of a nodelist that is allowed for 
the slab allocator. Then modify slub to use its own nodemap instead of 
the node online map. Modify get_partial_node to not try a node not in the 
nodemap and go to get_any_partial immediately. In addition to checking 
cpuset_zone_allowed we would need to check the slab node list.

Hmm.... That would also help to create isolated nodes that have no memory 
on them.

See what evil things you drive me to...

Could you try this patch (untested)? Set the allowed nodes on boot
with

slub_nodes=0

if you have only node 0 for SLUB.

---
 mm/slub.c |   69 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 63 insertions(+), 6 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2007-06-07 20:32:30.000000000 -0700
+++ linux-2.6/mm/slub.c	2007-06-07 20:48:19.000000000 -0700
@@ -270,6 +270,20 @@ static inline struct kmem_cache_node *ge
 #endif
 }
 
+#ifdef CONFIG_NUMA
+static nodemask_t slub_nodes = NODE_MASK_ALL;
+
+static inline int forbidden_node(int node)
+{
+	return !node_isset(node, slub_nodes);
+}
+#else
+static inline int forbidden_node(int node)
+{
+	return 0;
+}
+#endif
+
 static inline int check_valid_pointer(struct kmem_cache *s,
 				struct page *page, const void *object)
 {
@@ -1242,8 +1256,12 @@ static struct page *get_any_partial(stru
 					->node_zonelists[gfp_zone(flags)];
 	for (z = zonelist->zones; *z; z++) {
 		struct kmem_cache_node *n;
+		int node = zone_to_nid(*z);
 
-		n = get_node(s, zone_to_nid(*z));
+		if (forbidden_node(node))
+			continue;
+
+		n = get_node(s, node);
 
 		if (n && cpuset_zone_allowed_hardwall(*z, flags) &&
 				n->nr_partial > MIN_PARTIAL) {
@@ -1261,10 +1279,12 @@ static struct page *get_any_partial(stru
  */
 static struct page *get_partial(struct kmem_cache *s, gfp_t flags, int node)
 {
-	struct page *page;
+	struct page *page = NULL;
 	int searchnode = (node == -1) ? numa_node_id() : node;
 
-	page = get_partial_node(get_node(s, searchnode));
+	if (!forbidden_node(node))
+		page = get_partial_node(get_node(s, searchnode));
+
 	if (page || (flags & __GFP_THISNODE))
 		return page;
 
@@ -1819,7 +1839,11 @@ static void free_kmem_cache_nodes(struct
 	int node;
 
 	for_each_online_node(node) {
-		struct kmem_cache_node *n = s->node[node];
+		struct kmem_cache_node *n;
+
+		if (forbidden_node(node))
+			continue;
+		n= s->node[node];
 		if (n && n != &s->local_node)
 			kmem_cache_free(kmalloc_caches, n);
 		s->node[node] = NULL;
@@ -1839,6 +1863,9 @@ static int init_kmem_cache_nodes(struct 
 	for_each_online_node(node) {
 		struct kmem_cache_node *n;
 
+		if (forbidden_node(node))
+			continue;
+
 		if (local_node == node)
 			n = &s->local_node;
 		else {
@@ -2092,7 +2119,12 @@ static int kmem_cache_close(struct kmem_
 
 	/* Attempt to free all objects */
 	for_each_online_node(node) {
-		struct kmem_cache_node *n = get_node(s, node);
+		struct kmem_cache_node *n;
+
+		if (forbidden_node(node))
+			continue;
+
+		n = get_node(s, node);
 
 		n->nr_partial -= free_list(s, n, &n->partial);
 		if (atomic_long_read(&n->nr_slabs))
@@ -2167,6 +2199,17 @@ static int __init setup_slub_nomerge(cha
 
 __setup("slub_nomerge", setup_slub_nomerge);
 
+#ifdef CONFIG_NUMA
+static int __init setup_slub_nodes(char *str)
+{
+	if (*str == '=')
+		nodelist_parse(str + 1, slub_nodes);
+	return 1;
+}
+
+__setup("slub_nodes", setup_slub_nodes);
+#endif
+
 static struct kmem_cache *create_kmalloc_cache(struct kmem_cache *s,
 		const char *name, int size, gfp_t gfp_flags)
 {
@@ -2329,6 +2372,9 @@ int kmem_cache_shrink(struct kmem_cache 
 
 	flush_all(s);
 	for_each_online_node(node) {
+		if (forbidden_node(node))
+			continue;
+
 		n = get_node(s, node);
 
 		if (!n->nr_partial)
@@ -2755,7 +2801,12 @@ static unsigned long validate_slab_cache
 
 	flush_all(s);
 	for_each_online_node(node) {
-		struct kmem_cache_node *n = get_node(s, node);
+		struct kmem_cache_node *n;
+
+		if (forbidden_node(node))
+			continue;
+
+		n = get_node(s, node);
 
 		count += validate_slab_node(s, n);
 	}
@@ -2981,6 +3032,9 @@ static int list_locations(struct kmem_ca
 		unsigned long flags;
 		struct page *page;
 
+		if (forbidden_node(node))
+			continue;
+
 		if (!atomic_read(&n->nr_slabs))
 			continue;
 
@@ -3104,6 +3158,9 @@ static unsigned long slab_objects(struct
 	for_each_online_node(node) {
 		struct kmem_cache_node *n = get_node(s, node);
 
+		if (forbidden_node(node))
+			continue;
+
 		if (flags & SO_PARTIAL) {
 			if (flags & SO_OBJECTS)
 				x = count_partial(n);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] numa: mempolicy: dynamic interleave map for system init.
  2007-06-08  3:49       ` Christoph Lameter
@ 2007-06-08  4:13         ` Paul Mundt
  2007-06-08  4:27           ` Christoph Lameter
  0 siblings, 1 reply; 21+ messages in thread
From: Paul Mundt @ 2007-06-08  4:13 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, linux-mm, ak, hugh, lee.schermerhorn, mpm

On Thu, Jun 07, 2007 at 08:49:53PM -0700, Christoph Lameter wrote:
> On Fri, 8 Jun 2007, Paul Mundt wrote:
> 
> > obviously possible to try to work that in to SLOB or something similar,
> > if making SLUB or SLAB lighterweight and more tunable for these cases
> > ends up being a real barrier.
> 
> Its obviously possible and as far as I can tell the architecture you have 
> there requires it to operate. But the question is how much special casing 
> we will have to add to the core VM.
> 
> We would likely have to add a 
> 
> slub_nodes=
> 
> parameter that allows the specification of a nodelist that is allowed for 
> the slab allocator. Then modify slub to use its own nodemap instead of 
> the node online map. Modify get_partial_node to not try a node not in the 
> nodemap and go to get_any_partial immediately. In addition to checking 
> cpuset_zone_allowed we would need to check the slab node list.
> 
> Hmm.... That would also help to create isolated nodes that have no memory 
> on them.
> 
> See what evil things you drive me to...
> 
> Could you try this patch (untested)? Set the allowed nodes on boot
> with
> 
> slub_nodes=0
> 
> if you have only node 0 for SLUB.
> 
Yes, that works better (Note that node 1 interleave is disabled in both cases):

With patch:
/ # cat /sys/devices/system/node/node1/meminfo

Node 1 MemTotal:          128 kB
Node 1 MemFree:            72 kB
Node 1 MemUsed:            56 kB
Node 1 Active:              0 kB
Node 1 Inactive:            0 kB
Node 1 Dirty:               0 kB
Node 1 Writeback:           0 kB
Node 1 FilePages:           0 kB
Node 1 Mapped:              0 kB
Node 1 AnonPages:           0 kB
Node 1 PageTables:          0 kB
Node 1 NFS_Unstable:        0 kB
Node 1 Bounce:              0 kB
Node 1 Slab:                0 kB
Node 1 SReclaimable:        0 kB
Node 1 SUnreclaim:          0 kB
Node 1 HugePages_Total:     0
Node 1 HugePages_Free:      0

[  117.216293] Node 0 Normal free:55900kB min:1016kB low:1268kB high:1524kB active:692kB inactive:536kB present:65024kB pages_scanned:0 all_unreclaimable? no
[  117.230029] lowmem_reserve[]: 0
[  117.233140] Node 1 Normal free:72kB min:0kB low:0kB high:0kB active:0kB inactive:0kB present:128kB pages_scanned:0 all_unreclaimable? no
[  117.245322] lowmem_reserve[]: 0
[  117.248434] Node 0 Normal: 1*4kB 5*8kB 3*16kB 0*32kB 0*64kB 0*128kB 0*256kB 1*512kB 0*1024kB 1*2048kB 13*4096kB = 55900kB
[  117.259320] Node 1 Normal: 2*4kB 0*8kB 0*16kB 0*32kB 1*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 72kB

Without:
/ # cat /sys/devices/system/node/node1/meminfo

Node 1 MemTotal:          128 kB
Node 1 MemFree:            64 kB
Node 1 MemUsed:            64 kB
Node 1 Active:              0 kB
Node 1 Inactive:            0 kB
Node 1 Dirty:               0 kB
Node 1 Writeback:           0 kB
Node 1 FilePages:           0 kB
Node 1 Mapped:              0 kB
Node 1 AnonPages:           0 kB
Node 1 PageTables:          0 kB
Node 1 NFS_Unstable:        0 kB
Node 1 Bounce:              0 kB
Node 1 Slab:                8 kB
Node 1 SReclaimable:        0 kB
Node 1 SUnreclaim:          8 kB
Node 1 HugePages_Total:     0
Node 1 HugePages_Free:      0

[   87.000717] Node 0 Normal free:55912kB min:1016kB low:1268kB high:1524kB active:668kB inactive:556kB present:65024kB pages_scanned:0 all_unreclaimable? no
[   87.014453] lowmem_reserve[]: 0
[   87.017565] Node 1 Normal free:64kB min:0kB low:0kB high:0kB active:0kB inactive:0kB present:128kB pages_scanned:0 all_unreclaimable? no
[   87.029746] lowmem_reserve[]: 0
[   87.032858] Node 0 Normal: 0*4kB 9*8kB 2*16kB 0*32kB 0*64kB 0*128kB 0*256kB 1*512kB 0*1024kB 1*2048kB 13*4096kB = 55912kB
[   87.043744] Node 1 Normal: 0*4kB 0*8kB 0*16kB 0*32kB 1*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 64kB

So at least that gets back the couple of slab pages!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] numa: mempolicy: dynamic interleave map for system init.
  2007-06-08  4:13         ` Paul Mundt
@ 2007-06-08  4:27           ` Christoph Lameter
  2007-06-08  6:05             ` Paul Mundt
  0 siblings, 1 reply; 21+ messages in thread
From: Christoph Lameter @ 2007-06-08  4:27 UTC (permalink / raw)
  To: Paul Mundt; +Cc: Andrew Morton, linux-mm, ak, hugh, lee.schermerhorn, mpm

On Fri, 8 Jun 2007, Paul Mundt wrote:

> Node 1 SUnreclaim:          8 kB

> So at least that gets back the couple of slab pages!

Hmmmm.. is that worth it? The patch is not right btw. There is still the 
case that new_slab can acquire a page on the wrong node and since we are 
not setup to allow that node in SLUB we will crash.

This now gets a bit ugly. In order to avoid that situation we check
first if the node is allowed. If not then we simply ask for an alloc on
the first node.

But that may still make the page allocator fall back. If that happens then
we redo the allocation with GFP_THISNODE to force an allocation on the 
first node or fail.

I think we could do better by constructing a custom zonelist but that will 
be even more special casing.


---
 mm/slub.c |   63 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 58 insertions(+), 5 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2007-06-07 21:01:32.000000000 -0700
+++ linux-2.6/mm/slub.c	2007-06-07 21:23:04.000000000 -0700
@@ -215,6 +215,10 @@ static inline void ClearSlabDebug(struct
 
 static int kmem_size = sizeof(struct kmem_cache);
 
+#ifdef CONFIG_NUMA
+static nodemask_t slub_nodes = NODE_MASK_ALL;
+#endif
+
 #ifdef CONFIG_SMP
 static struct notifier_block slab_notifier;
 #endif
@@ -1023,6 +1027,11 @@ static struct page *new_slab(struct kmem
 	if (flags & __GFP_WAIT)
 		local_irq_enable();
 
+	/* Hack: Just get the first node if the node is not allowed */
+	if (slab_state >= UP && !get_node(s, node))
+		node = first_node(slub_nodes);
+
+redo:
 	page = allocate_slab(s, flags & GFP_LEVEL_MASK, node);
 	if (!page)
 		goto out;
@@ -1030,6 +1039,27 @@ static struct page *new_slab(struct kmem
 	n = get_node(s, page_to_nid(page));
 	if (n)
 		atomic_long_inc(&n->nr_slabs);
+#ifdef CONFIG_NUMA
+	else {
+		if (slab_state >= UP) {
+			/*
+			 * The baaad page allocator gave us a page on a
+			 * node that we should not use. Force a page on
+			 * a legit node or fail.
+			 */
+			__free_pages(page, s->order);
+			flags |= GFP_THISNODE;
+
+			mod_zone_page_state(page_zone(page),
+				(s->flags & SLAB_RECLAIM_ACCOUNT) ?
+			NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE,
+				- (1 << s->order));
+
+			goto redo;
+		}
+	}
+#endif
+
 	page->offset = s->offset / sizeof(void *);
 	page->slab = s;
 	page->flags |= 1 << PG_slab;
@@ -1261,10 +1291,13 @@ static struct page *get_any_partial(stru
  */
 static struct page *get_partial(struct kmem_cache *s, gfp_t flags, int node)
 {
-	struct page *page;
+	struct page *page = NULL;
 	int searchnode = (node == -1) ? numa_node_id() : node;
+	struct kmem_cache_node *n = get_node(s, searchnode);
+
+	if (n)
+		page = get_partial_node(n);
 
-	page = get_partial_node(get_node(s, searchnode));
 	if (page || (flags & __GFP_THISNODE))
 		return page;
 
@@ -1820,12 +1853,22 @@ static void free_kmem_cache_nodes(struct
 
 	for_each_online_node(node) {
 		struct kmem_cache_node *n = s->node[node];
+
 		if (n && n != &s->local_node)
 			kmem_cache_free(kmalloc_caches, n);
 		s->node[node] = NULL;
 	}
 }
 
+static int __init setup_slub_nodes(char *str)
+{
+	if (*str == '=')
+		nodelist_parse(str + 1, slub_nodes);
+	return 1;
+}
+
+__setup("slub_nodes", setup_slub_nodes);
+
 static int init_kmem_cache_nodes(struct kmem_cache *s, gfp_t gfpflags)
 {
 	int node;
@@ -1839,6 +1882,9 @@ static int init_kmem_cache_nodes(struct 
 	for_each_online_node(node) {
 		struct kmem_cache_node *n;
 
+		if (!node_isset(node, slub_nodes))
+			continue;
+
 		if (local_node == node)
 			n = &s->local_node;
 		else {
@@ -2094,6 +2140,9 @@ static int kmem_cache_close(struct kmem_
 	for_each_online_node(node) {
 		struct kmem_cache_node *n = get_node(s, node);
 
+		if (!n)
+			continue;
+
 		n->nr_partial -= free_list(s, n, &n->partial);
 		if (atomic_long_read(&n->nr_slabs))
 			return 1;
@@ -2331,7 +2380,7 @@ int kmem_cache_shrink(struct kmem_cache 
 	for_each_online_node(node) {
 		n = get_node(s, node);
 
-		if (!n->nr_partial)
+		if (!n || !n->nr_partial)
 			continue;
 
 		for (i = 0; i < s->objects; i++)
@@ -2757,7 +2806,8 @@ static unsigned long validate_slab_cache
 	for_each_online_node(node) {
 		struct kmem_cache_node *n = get_node(s, node);
 
-		count += validate_slab_node(s, n);
+		if (n)
+			count += validate_slab_node(s, n);
 	}
 	return count;
 }
@@ -2981,7 +3031,7 @@ static int list_locations(struct kmem_ca
 		unsigned long flags;
 		struct page *page;
 
-		if (!atomic_read(&n->nr_slabs))
+		if (!n || !atomic_read(&n->nr_slabs))
 			continue;
 
 		spin_lock_irqsave(&n->list_lock, flags);
@@ -3104,6 +3154,9 @@ static unsigned long slab_objects(struct
 	for_each_online_node(node) {
 		struct kmem_cache_node *n = get_node(s, node);
 
+		if (!n)
+			continue;
+
 		if (flags & SO_PARTIAL) {
 			if (flags & SO_OBJECTS)
 				x = count_partial(n);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] numa: mempolicy: dynamic interleave map for system init.
  2007-06-08  4:27           ` Christoph Lameter
@ 2007-06-08  6:05             ` Paul Mundt
  2007-06-08  6:09               ` Christoph Lameter
  0 siblings, 1 reply; 21+ messages in thread
From: Paul Mundt @ 2007-06-08  6:05 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, linux-mm, ak, hugh, lee.schermerhorn, mpm

On Thu, Jun 07, 2007 at 09:27:01PM -0700, Christoph Lameter wrote:
> On Fri, 8 Jun 2007, Paul Mundt wrote:
> 
> > Node 1 SUnreclaim:          8 kB
> 
> > So at least that gets back the couple of slab pages!
> 
> Hmmmm.. is that worth it? The patch is not right btw. There is still the 
> case that new_slab can acquire a page on the wrong node and since we are 
> not setup to allow that node in SLUB we will crash.
> 
Well, every page we can get back is a win in this situation, since we're
talking about individual pages being used by applications. The other 56k
is a bit more problematic, but that's something I'd like to narrow down
as well. I don't mind giving up a chunk of the node as long as the
majority of it is usable for applications, but certainly every page we
can get back helps.

> This now gets a bit ugly. In order to avoid that situation we check
> first if the node is allowed. If not then we simply ask for an alloc on
> the first node.
> 
> But that may still make the page allocator fall back. If that happens then
> we redo the allocation with GFP_THISNODE to force an allocation on the 
> first node or fail.
> 
This patch works fine for the few cases I've tried at least.

> I think we could do better by constructing a custom zonelist but that will 
> be even more special casing.
> 
I don't know if a custom zonelist is worth the trouble. For the common
asymmetric case, you could at least infer that ZONE_NORMAL is the only
thing populated per node (well, small nodes other than node 0). If you
mean just creating the zonelist from the range of allowable SLUB nodes,
that could work.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] numa: mempolicy: dynamic interleave map for system init.
  2007-06-08  6:05             ` Paul Mundt
@ 2007-06-08  6:09               ` Christoph Lameter
  2007-06-08  6:27                 ` Paul Mundt
  0 siblings, 1 reply; 21+ messages in thread
From: Christoph Lameter @ 2007-06-08  6:09 UTC (permalink / raw)
  To: Paul Mundt; +Cc: Andrew Morton, linux-mm, ak, hugh, lee.schermerhorn, mpm

On Fri, 8 Jun 2007, Paul Mundt wrote:

> > I think we could do better by constructing a custom zonelist but that will 
> > be even more special casing.
> > 
> I don't know if a custom zonelist is worth the trouble. For the common
> asymmetric case, you could at least infer that ZONE_NORMAL is the only
> thing populated per node (well, small nodes other than node 0). If you
> mean just creating the zonelist from the range of allowable SLUB nodes,
> that could work.

Well that is quit difficult because of the other constraints on the alloc. 
The allocation must consider the cpuset context and the memory policies of 
the task (which may need special casing already there for interleave). 
Maybe we can determine from those restrictions a zonelist. Then we need to 
kick out the zones belonging to the illegal nodes from that zonelist.  
Then pass that to __alloc_pages to perform the alloc.

Looks like we are heading for a new alloc function

alloc_pages_node_not_nodes(order, gfpmask, node, forbidden-nodes)

But may be the hack of just going to node 0 on a problem is enough???

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] numa: mempolicy: dynamic interleave map for system init.
  2007-06-08  6:09               ` Christoph Lameter
@ 2007-06-08  6:27                 ` Paul Mundt
  2007-06-08  6:43                   ` Christoph Lameter
  0 siblings, 1 reply; 21+ messages in thread
From: Paul Mundt @ 2007-06-08  6:27 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, linux-mm, ak, hugh, lee.schermerhorn, mpm

On Thu, Jun 07, 2007 at 11:09:48PM -0700, Christoph Lameter wrote:
> On Fri, 8 Jun 2007, Paul Mundt wrote:
> 
> > > I think we could do better by constructing a custom zonelist but that will 
> > > be even more special casing.
> > > 
> > I don't know if a custom zonelist is worth the trouble. For the common
> > asymmetric case, you could at least infer that ZONE_NORMAL is the only
> > thing populated per node (well, small nodes other than node 0). If you
> > mean just creating the zonelist from the range of allowable SLUB nodes,
> > that could work.
> 
> Well that is quit difficult because of the other constraints on the alloc. 
> The allocation must consider the cpuset context and the memory policies of 
> the task (which may need special casing already there for interleave). 
> Maybe we can determine from those restrictions a zonelist. Then we need to 
> kick out the zones belonging to the illegal nodes from that zonelist.  
> Then pass that to __alloc_pages to perform the alloc.
> 
> Looks like we are heading for a new alloc function
> 
> alloc_pages_node_not_nodes(order, gfpmask, node, forbidden-nodes)
> 
> But may be the hack of just going to node 0 on a problem is enough???

That depends on the policy, in the MPOL_BIND case we certainly don't want
to bleed out to node 0. For the general case, falling back on node 0 in
the event of a problem seems to be a reasonable compromise.

In the longer term, alloc_pages_not_nodes() may be worthwhile for the
cases where symmetric and asymmetric nodes are both present, without
wanting to put all of the pressure on node 0. This is largely why I was
leaning towards flags in the pgdat, suggesting what the node is willing
to put up with. It would be fairly trivial to construct a map of
allowable SLUB nodes and potentials for the zonelist out of that. This
still doesn't solve the problem of cpuset constraints, though.

Incidentally, the interleave map created for mempol sysinit is something
that could also be picked up by SLUB for the allowable node map (at least
as a starting point, exlucding cpuset constraints).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] numa: mempolicy: dynamic interleave map for system init.
  2007-06-08  6:27                 ` Paul Mundt
@ 2007-06-08  6:43                   ` Christoph Lameter
  0 siblings, 0 replies; 21+ messages in thread
From: Christoph Lameter @ 2007-06-08  6:43 UTC (permalink / raw)
  To: Paul Mundt; +Cc: Andrew Morton, linux-mm, ak, hugh, lee.schermerhorn, mpm

On Fri, 8 Jun 2007, Paul Mundt wrote:

> Incidentally, the interleave map created for mempol sysinit is something
> that could also be picked up by SLUB for the allowable node map (at least
> as a starting point, exlucding cpuset constraints).

SLUB already uses that map on bootup through the page allocator. So for 
boot you can actually restrict slub without any additional patches. The 
problem is later when the policy is set to MPOL_DEFAULT.

The key problem is that the node restrictions add an additional constraint 
to the ones that SLUB already obeys.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] numa: mempolicy: dynamic interleave map for system init.
  2007-06-08  3:25     ` Paul Mundt
  2007-06-08  3:49       ` Christoph Lameter
@ 2007-06-08 14:50       ` Matt Mackall
  2007-06-12  2:36         ` Nick Piggin
  2007-06-12  9:43         ` Paul Mundt
  1 sibling, 2 replies; 21+ messages in thread
From: Matt Mackall @ 2007-06-08 14:50 UTC (permalink / raw)
  To: Paul Mundt, Christoph Lameter, Andrew Morton, linux-mm, ak, hugh,
	lee.schermerhorn, Nick Piggin

On Fri, Jun 08, 2007 at 12:25:05PM +0900, Paul Mundt wrote:
> On Thu, Jun 07, 2007 at 07:47:09PM -0700, Christoph Lameter wrote:
> > On Thu, 7 Jun 2007, Andrew Morton wrote:
> > 
> > > Well I took silence as assent.
> > 
> > Well, grudgingly. How far are we willing to go to support these asymmetric 
> > setups? The NUMA code initially was designed for mostly symmetric systems 
> > with roughly the same amount of memory on each node. The farther we go 
> > from this the more options we will have to add special casing to deal with 
> > these imbalances.
> > 
> Well, this doesn't all have to be dynamic either. I opted for the
> mpolinit= approach first so we wouldn't make the accounting for the
> common case heavier, but certainly having it dynamic is less hassle. The
> asymmetric case will likely be the common case for embedded, but it's
> obviously possible to try to work that in to SLOB or something similar,
> if making SLUB or SLAB lighterweight and more tunable for these cases
> ends up being a real barrier.
> 
> On the other hand, as we start having machines with multiple gigs of RAM
> that are stashed in node 0 (with many smaller memories in other nodes),
> SLOB isn't going to be a long-term option either.

SLOB in -mm should scale to this size reasonably well now, and Nick
and I have another tweak planned that should make it quite fast here.

SLOB's big scalability problem at this point is number of CPUs.
Throwing some fine-grained locking at it or the like may be able to
help with that too.

Why would you even want to bother making it scale that large? For
starters, it's less affected by things like dcache fragmentation. The
majority of pages pinned by long-lived dcache entries will still be
available to other allocations.

Haven't given any thought to NUMA yet though..

-- 
Mathematics is the supreme nostalgia of our time.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] numa: mempolicy: dynamic interleave map for system init.
  2007-06-08 14:50       ` Matt Mackall
@ 2007-06-12  2:36         ` Nick Piggin
  2007-06-12  9:43         ` Paul Mundt
  1 sibling, 0 replies; 21+ messages in thread
From: Nick Piggin @ 2007-06-12  2:36 UTC (permalink / raw)
  To: Matt Mackall
  Cc: Paul Mundt, Christoph Lameter, Andrew Morton, linux-mm, ak, hugh,
	lee.schermerhorn

Matt Mackall wrote:
> On Fri, Jun 08, 2007 at 12:25:05PM +0900, Paul Mundt wrote:
> 
>>Well, this doesn't all have to be dynamic either. I opted for the
>>mpolinit= approach first so we wouldn't make the accounting for the
>>common case heavier, but certainly having it dynamic is less hassle. The
>>asymmetric case will likely be the common case for embedded, but it's
>>obviously possible to try to work that in to SLOB or something similar,
>>if making SLUB or SLAB lighterweight and more tunable for these cases
>>ends up being a real barrier.
>>
>>On the other hand, as we start having machines with multiple gigs of RAM
>>that are stashed in node 0 (with many smaller memories in other nodes),
>>SLOB isn't going to be a long-term option either.
> 
> 
> SLOB in -mm should scale to this size reasonably well now, and Nick
> and I have another tweak planned that should make it quite fast here.

Indeed. The existing code in -mm should hopefully get merged next cycle,
so if you have ever wanted to use SLOB but had performance problems, please
reevaluate and report if you still hit problems.

Even on small SMPs, it might be a reasonable choice, although it won't be
able to match the other allocators for performance. Again, if you have
problems with SMP scalability of SLOB, then please let us know too, because
as Matt said there are a few things we could do (such as multiple freelists)
which may improve performance quite a bit without hurting complexity or
memory usage much.

-- 
SUSE Labs, Novell Inc.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] numa: mempolicy: dynamic interleave map for system init.
  2007-06-08 14:50       ` Matt Mackall
  2007-06-12  2:36         ` Nick Piggin
@ 2007-06-12  9:43         ` Paul Mundt
  2007-06-12 15:32           ` Matt Mackall
  1 sibling, 1 reply; 21+ messages in thread
From: Paul Mundt @ 2007-06-12  9:43 UTC (permalink / raw)
  To: Matt Mackall
  Cc: Christoph Lameter, Andrew Morton, linux-mm, ak, hugh,
	lee.schermerhorn, Nick Piggin

On Fri, Jun 08, 2007 at 09:50:11AM -0500, Matt Mackall wrote:
> SLOB's big scalability problem at this point is number of CPUs.
> Throwing some fine-grained locking at it or the like may be able to
> help with that too.
> 
> Why would you even want to bother making it scale that large? For
> starters, it's less affected by things like dcache fragmentation. The
> majority of pages pinned by long-lived dcache entries will still be
> available to other allocations.
> 
> Haven't given any thought to NUMA yet though..
> 
This is what I've hacked together and tested with my small nodes. It's
not terribly intelligent, and it pushes off most of the logic to the page
allocator. Obviously it's not terribly scalable, and I haven't tested it
with page migration, either. Still, it works for me with my simple tmpfs
+ mpol policy tests.

Tested on a UP + SPARSEMEM (static, not extreme) + NUMA (2 nodes) + SLOB
configuration.

Flame away!

Signed-off-by: Paul Mundt <lethal@linux-sh.org>

--

 include/linux/slab.h |    7 ++++
 mm/slob.c            |   80 ++++++++++++++++++++++++++++++++++++++++++---------
 2 files changed, 73 insertions(+), 14 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index a015236..efc87c1 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -200,6 +200,13 @@ static inline void *__kmalloc_node(size_t size, gfp_t flags, int node)
 {
 	return __kmalloc(size, flags);
 }
+#elif defined(CONFIG_SLOB)
+extern void *__kmalloc_node(size_t size, gfp_t flags, int node);
+
+static inline void *kmalloc_node(size_t size, gfp_t flags, int node)
+{
+	return __kmalloc_node(size, flags, node);
+}
 #endif /* !CONFIG_NUMA */
 
 /*
diff --git a/mm/slob.c b/mm/slob.c
index 71976c5..48af24c 100644
--- a/mm/slob.c
+++ b/mm/slob.c
@@ -74,7 +74,7 @@ static void slob_free(void *b, int size);
 static void slob_timer_cbk(void);
 
 
-static void *slob_alloc(size_t size, gfp_t gfp, int align)
+static void *slob_alloc(size_t size, gfp_t gfp, int align, int node)
 {
 	slob_t *prev, *cur, *aligned = 0;
 	int delta = 0, units = SLOB_UNITS(size);
@@ -111,12 +111,19 @@ static void *slob_alloc(size_t size, gfp_t gfp, int align)
 			return cur;
 		}
 		if (cur == slobfree) {
+			void *pages;
+
 			spin_unlock_irqrestore(&slob_lock, flags);
 
 			if (size == PAGE_SIZE) /* trying to shrink arena? */
 				return 0;
 
-			cur = (slob_t *)__get_free_page(gfp);
+			if (node == -1)
+				pages = alloc_pages(gfp, 0);
+			else
+				pages = alloc_pages_node(node, gfp, 0);
+
+			cur = page_address(pages);
 			if (!cur)
 				return 0;
 
@@ -161,23 +168,29 @@ static void slob_free(void *block, int size)
 	spin_unlock_irqrestore(&slob_lock, flags);
 }
 
-void *__kmalloc(size_t size, gfp_t gfp)
+static void *__kmalloc_alloc(size_t size, gfp_t gfp, int node)
 {
 	slob_t *m;
 	bigblock_t *bb;
 	unsigned long flags;
+	void *page;
 
 	if (size < PAGE_SIZE - SLOB_UNIT) {
-		m = slob_alloc(size + SLOB_UNIT, gfp, 0);
+		m = slob_alloc(size + SLOB_UNIT, gfp, 0, node);
 		return m ? (void *)(m + 1) : 0;
 	}
 
-	bb = slob_alloc(sizeof(bigblock_t), gfp, 0);
+	bb = slob_alloc(sizeof(bigblock_t), gfp, 0, node);
 	if (!bb)
 		return 0;
 
 	bb->order = get_order(size);
-	bb->pages = (void *)__get_free_pages(gfp, bb->order);
+	if (node == -1)
+		page = alloc_pages(gfp, bb->order);
+	else
+		page = alloc_pages_node(node, gfp, bb->order);
+
+	bb->pages = page_address(page);
 
 	if (bb->pages) {
 		spin_lock_irqsave(&block_lock, flags);
@@ -190,8 +203,21 @@ void *__kmalloc(size_t size, gfp_t gfp)
 	slob_free(bb, sizeof(bigblock_t));
 	return 0;
 }
+
+void *__kmalloc(size_t size, gfp_t gfp)
+{
+	return __kmalloc_alloc(size, gfp, -1);
+}
 EXPORT_SYMBOL(__kmalloc);
 
+#ifdef CONFIG_NUMA
+void *__kmalloc_node(size_t size, gfp_t gfp, int node)
+{
+	return __kmalloc_alloc(size, gfp, node);
+}
+EXPORT_SYMBOL(__kmalloc_node);
+#endif
+
 /**
  * krealloc - reallocate memory. The contents will remain unchanged.
  *
@@ -289,7 +315,7 @@ struct kmem_cache *kmem_cache_create(const char *name, size_t size,
 {
 	struct kmem_cache *c;
 
-	c = slob_alloc(sizeof(struct kmem_cache), flags, 0);
+	c = slob_alloc(sizeof(struct kmem_cache), flags, 0, -1);
 
 	if (c) {
 		c->name = name;
@@ -317,22 +343,44 @@ void kmem_cache_destroy(struct kmem_cache *c)
 }
 EXPORT_SYMBOL(kmem_cache_destroy);
 
-void *kmem_cache_alloc(struct kmem_cache *c, gfp_t flags)
+static void *__kmem_cache_alloc(struct kmem_cache *c, gfp_t flags, int node)
 {
 	void *b;
 
 	if (c->size < PAGE_SIZE)
-		b = slob_alloc(c->size, flags, c->align);
-	else
-		b = (void *)__get_free_pages(flags, get_order(c->size));
+		b = slob_alloc(c->size, flags, c->align, node);
+	else {
+		void *pages;
+
+		if (node == -1)
+			pages = alloc_pages(flags, get_order(c->size));
+		else
+			pages = alloc_pages_node(node, flags,
+						get_order(c->size));
+
+		b = page_address(pages);
+	}
 
 	if (c->ctor)
 		c->ctor(b, c, 0);
 
 	return b;
 }
+
+void *kmem_cache_alloc(struct kmem_cache *c, gfp_t flags)
+{
+	return __kmem_cache_alloc(c, flags, -1);
+}
 EXPORT_SYMBOL(kmem_cache_alloc);
 
+#ifdef CONFIG_NUMA
+void *kmem_cache_alloc_node(struct kmem_cache *c, gfp_t flags, int node)
+{
+	return __kmem_cache_alloc(c, flags, node);
+}
+EXPORT_SYMBOL(kmem_cache_alloc_node);
+#endif
+
 void *kmem_cache_zalloc(struct kmem_cache *c, gfp_t flags)
 {
 	void *ret = kmem_cache_alloc(c, flags);
@@ -406,10 +454,14 @@ void __init kmem_cache_init(void)
 
 static void slob_timer_cbk(void)
 {
-	void *p = slob_alloc(PAGE_SIZE, 0, PAGE_SIZE-1);
+	int node;
+
+	for_each_online_node(node) {
+		void *p = slob_alloc(PAGE_SIZE, 0, PAGE_SIZE-1, node);
 
-	if (p)
-		free_page((unsigned long)p);
+		if (p)
+			free_page((unsigned long)p);
+	}
 
 	mod_timer(&slob_timer, jiffies + HZ);
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] numa: mempolicy: dynamic interleave map for system init.
  2007-06-12  9:43         ` Paul Mundt
@ 2007-06-12 15:32           ` Matt Mackall
  2007-06-13  2:10             ` Nick Piggin
  2007-06-13  2:53             ` Paul Mundt
  0 siblings, 2 replies; 21+ messages in thread
From: Matt Mackall @ 2007-06-12 15:32 UTC (permalink / raw)
  To: Paul Mundt, Christoph Lameter, Andrew Morton, linux-mm, ak, hugh,
	lee.schermerhorn, Nick Piggin

On Tue, Jun 12, 2007 at 06:43:59PM +0900, Paul Mundt wrote:
> On Fri, Jun 08, 2007 at 09:50:11AM -0500, Matt Mackall wrote:
> > SLOB's big scalability problem at this point is number of CPUs.
> > Throwing some fine-grained locking at it or the like may be able to
> > help with that too.
> > 
> > Why would you even want to bother making it scale that large? For
> > starters, it's less affected by things like dcache fragmentation. The
> > majority of pages pinned by long-lived dcache entries will still be
> > available to other allocations.
> > 
> > Haven't given any thought to NUMA yet though..
> > 
> This is what I've hacked together and tested with my small nodes. It's
> not terribly intelligent, and it pushes off most of the logic to the page
> allocator. Obviously it's not terribly scalable, and I haven't tested it
> with page migration, either. Still, it works for me with my simple tmpfs
> + mpol policy tests.
> 
> Tested on a UP + SPARSEMEM (static, not extreme) + NUMA (2 nodes) + SLOB
> configuration.
> 
> Flame away!

For starters, it's not against the current SLOB, which no longer has
the bigblock list.

> -void *__kmalloc(size_t size, gfp_t gfp)
> +static void *__kmalloc_alloc(size_t size, gfp_t gfp, int node)

That's a ridiculous name. So, uh.. more underbars!

Though really, I think you can just name it __kmalloc_node?

> +		if (node == -1)
> +			pages = alloc_pages(flags, get_order(c->size));
> +		else
> +			pages = alloc_pages_node(node, flags,
> +						get_order(c->size));

This fragment appears a few times. Looks like it ought to get its own
function. And that function can reduce to a trivial inline in the
!NUMA case.

> +void *kmem_cache_alloc_node(struct kmem_cache *c, gfp_t flags, int node)
> +{
> +	return __kmem_cache_alloc(c, flags, node);
> +}

If we make the underlying functions all take a node, this stuff all
gets simpler.

>  static void slob_timer_cbk(void)

This is gone in the latest SLOB too.

-- 
Mathematics is the supreme nostalgia of our time.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] numa: mempolicy: dynamic interleave map for system init.
  2007-06-12 15:32           ` Matt Mackall
@ 2007-06-13  2:10             ` Nick Piggin
  2007-06-13  3:12               ` Matt Mackall
  2007-06-13  2:53             ` Paul Mundt
  1 sibling, 1 reply; 21+ messages in thread
From: Nick Piggin @ 2007-06-13  2:10 UTC (permalink / raw)
  To: Matt Mackall
  Cc: Paul Mundt, Christoph Lameter, Andrew Morton, linux-mm, ak, hugh,
	lee.schermerhorn

Matt Mackall wrote:
> On Tue, Jun 12, 2007 at 06:43:59PM +0900, Paul Mundt wrote:
> 
>>On Fri, Jun 08, 2007 at 09:50:11AM -0500, Matt Mackall wrote:
>>
>>>SLOB's big scalability problem at this point is number of CPUs.
>>>Throwing some fine-grained locking at it or the like may be able to
>>>help with that too.
>>>
>>>Why would you even want to bother making it scale that large? For
>>>starters, it's less affected by things like dcache fragmentation. The
>>>majority of pages pinned by long-lived dcache entries will still be
>>>available to other allocations.
>>>
>>>Haven't given any thought to NUMA yet though..
>>>
>>
>>This is what I've hacked together and tested with my small nodes. It's
>>not terribly intelligent, and it pushes off most of the logic to the page
>>allocator. Obviously it's not terribly scalable, and I haven't tested it
>>with page migration, either. Still, it works for me with my simple tmpfs
>>+ mpol policy tests.
>>
>>Tested on a UP + SPARSEMEM (static, not extreme) + NUMA (2 nodes) + SLOB
>>configuration.
>>
>>Flame away!
> 
> 
> For starters, it's not against the current SLOB, which no longer has
> the bigblock list.
> 
> 
>>-void *__kmalloc(size_t size, gfp_t gfp)
>>+static void *__kmalloc_alloc(size_t size, gfp_t gfp, int node)
> 
> 
> That's a ridiculous name. So, uh.. more underbars!
> 
> Though really, I think you can just name it __kmalloc_node?
> 
> 
>>+		if (node == -1)
>>+			pages = alloc_pages(flags, get_order(c->size));
>>+		else
>>+			pages = alloc_pages_node(node, flags,
>>+						get_order(c->size));
> 
> 
> This fragment appears a few times. Looks like it ought to get its own
> function. And that function can reduce to a trivial inline in the
> !NUMA case.

BTW. what I would like to see tried initially -- which may give reasonable
scalability and NUMAness -- is perhaps a percpu or per-node free pages
lists. However these lists would not be exclusively per-cpu, because that
would result in worse memory consumption (we should always try to put
memory consumption above all else with SLOB).

So each list would have its own lock and can be accessed by any CPU, but
they would default to their own list first (or in the case of a
kmalloc_node, they could default to some other list).

Then we'd probably like to introduce a *little* bit of slack, so that we
will allocate a new page on our local list even if there is a small amount
of memory free on another list. I think this might be enough to get a
reasonable number of list-local allocations without blowing out the memory
usage much. The slack ratio could be configurable so at one extreme we
could always allocate from our local lists for best NUMA placement I guess.

I haven't given it a great deal of thought, so this strategy might go
horribly wrong in some cases... but I have a feeling something reasonably
simple like that might go a long way to improving locking scalability and
NUMAness.

-- 
SUSE Labs, Novell Inc.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] numa: mempolicy: dynamic interleave map for system init.
  2007-06-13  2:10             ` Nick Piggin
@ 2007-06-13  3:12               ` Matt Mackall
  0 siblings, 0 replies; 21+ messages in thread
From: Matt Mackall @ 2007-06-13  3:12 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Paul Mundt, Christoph Lameter, Andrew Morton, linux-mm, ak, hugh,
	lee.schermerhorn

On Wed, Jun 13, 2007 at 12:10:21PM +1000, Nick Piggin wrote:
> Matt Mackall wrote:
> >On Tue, Jun 12, 2007 at 06:43:59PM +0900, Paul Mundt wrote:
> >
> >>On Fri, Jun 08, 2007 at 09:50:11AM -0500, Matt Mackall wrote:
> >>
> >>>SLOB's big scalability problem at this point is number of CPUs.
> >>>Throwing some fine-grained locking at it or the like may be able to
> >>>help with that too.
> >>>
> >>>Why would you even want to bother making it scale that large? For
> >>>starters, it's less affected by things like dcache fragmentation. The
> >>>majority of pages pinned by long-lived dcache entries will still be
> >>>available to other allocations.
> >>>
> >>>Haven't given any thought to NUMA yet though..
> >>>
> >>
> >>This is what I've hacked together and tested with my small nodes. It's
> >>not terribly intelligent, and it pushes off most of the logic to the page
> >>allocator. Obviously it's not terribly scalable, and I haven't tested it
> >>with page migration, either. Still, it works for me with my simple tmpfs
> >>+ mpol policy tests.
> >>
> >>Tested on a UP + SPARSEMEM (static, not extreme) + NUMA (2 nodes) + SLOB
> >>configuration.
> >>
> >>Flame away!
> >
> >
> >For starters, it's not against the current SLOB, which no longer has
> >the bigblock list.
> >
> >
> >>-void *__kmalloc(size_t size, gfp_t gfp)
> >>+static void *__kmalloc_alloc(size_t size, gfp_t gfp, int node)
> >
> >
> >That's a ridiculous name. So, uh.. more underbars!
> >
> >Though really, I think you can just name it __kmalloc_node?
> >
> >
> >>+		if (node == -1)
> >>+			pages = alloc_pages(flags, get_order(c->size));
> >>+		else
> >>+			pages = alloc_pages_node(node, flags,
> >>+						get_order(c->size));
> >
> >
> >This fragment appears a few times. Looks like it ought to get its own
> >function. And that function can reduce to a trivial inline in the
> >!NUMA case.
> 
> BTW. what I would like to see tried initially -- which may give reasonable
> scalability and NUMAness -- is perhaps a percpu or per-node free pages
> lists. However these lists would not be exclusively per-cpu, because that
> would result in worse memory consumption (we should always try to put
> memory consumption above all else with SLOB).
>
> So each list would have its own lock and can be accessed by any CPU, but
> they would default to their own list first (or in the case of a
> kmalloc_node, they could default to some other list).
> 
> Then we'd probably like to introduce a *little* bit of slack, so that we
> will allocate a new page on our local list even if there is a small amount
> of memory free on another list. I think this might be enough to get a
> reasonable number of list-local allocations without blowing out the memory
> usage much. The slack ratio could be configurable so at one extreme we
> could always allocate from our local lists for best NUMA placement I guess.
> 
> I haven't given it a great deal of thought, so this strategy might go
> horribly wrong in some cases... but I have a feeling something reasonably
> simple like that might go a long way to improving locking scalability and
> NUMAness.

It's an interesting problem. There's a fair amount more we can do to
get performance up on SMP which should probably happen before we think
too much about NUMA.

-- 
Mathematics is the supreme nostalgia of our time.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] numa: mempolicy: dynamic interleave map for system init.
  2007-06-12 15:32           ` Matt Mackall
  2007-06-13  2:10             ` Nick Piggin
@ 2007-06-13  2:53             ` Paul Mundt
  2007-06-13  3:16               ` Matt Mackall
  1 sibling, 1 reply; 21+ messages in thread
From: Paul Mundt @ 2007-06-13  2:53 UTC (permalink / raw)
  To: Matt Mackall
  Cc: Christoph Lameter, Andrew Morton, linux-mm, ak, hugh,
	lee.schermerhorn, Nick Piggin

On Tue, Jun 12, 2007 at 10:32:34AM -0500, Matt Mackall wrote:
> On Tue, Jun 12, 2007 at 06:43:59PM +0900, Paul Mundt wrote:
> > On Fri, Jun 08, 2007 at 09:50:11AM -0500, Matt Mackall wrote:
> > > Haven't given any thought to NUMA yet though..
> > > 
> > This is what I've hacked together and tested with my small nodes. It's
> > not terribly intelligent, and it pushes off most of the logic to the page
> > allocator. Obviously it's not terribly scalable, and I haven't tested it
> > with page migration, either. Still, it works for me with my simple tmpfs
> > + mpol policy tests.
> > 
> > Tested on a UP + SPARSEMEM (static, not extreme) + NUMA (2 nodes) + SLOB
> > configuration.
> > 
> > Flame away!
> 
> For starters, it's not against the current SLOB, which no longer has
> the bigblock list.
> 
Sorry about that, seems I used the wrong tree.

> > -void *__kmalloc(size_t size, gfp_t gfp)
> > +static void *__kmalloc_alloc(size_t size, gfp_t gfp, int node)
> 
> That's a ridiculous name. So, uh.. more underbars!
> 
Agreed, though I couldn't think of a better one.

> Though really, I think you can just name it __kmalloc_node?
> 
No, kmalloc_node and __kmalloc_node are both required by CONFIG_NUMA,
otherwise that would have been the logical choice.

> > +		if (node == -1)
> > +			pages = alloc_pages(flags, get_order(c->size));
> > +		else
> > +			pages = alloc_pages_node(node, flags,
> > +						get_order(c->size));
> 
> This fragment appears a few times. Looks like it ought to get its own
> function. And that function can reduce to a trivial inline in the
> !NUMA case.
> 
Ok.

> > +void *kmem_cache_alloc_node(struct kmem_cache *c, gfp_t flags, int node)
> > +{
> > +	return __kmem_cache_alloc(c, flags, node);
> > +}
> 
> If we make the underlying functions all take a node, this stuff all
> gets simpler.
> 
Could you elaborate on that? We only require the node specifier in the
allocation path, and this simply hands it down in accordance with the
existing APIs. After allocation time the node id is encoded in the page
flags, so we can easily figure out which node a page is tied to.

I'll post the updated patch separately.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] numa: mempolicy: dynamic interleave map for system init.
  2007-06-13  2:53             ` Paul Mundt
@ 2007-06-13  3:16               ` Matt Mackall
  0 siblings, 0 replies; 21+ messages in thread
From: Matt Mackall @ 2007-06-13  3:16 UTC (permalink / raw)
  To: Paul Mundt, Christoph Lameter, Andrew Morton, linux-mm, ak, hugh,
	lee.schermerhorn, Nick Piggin

On Wed, Jun 13, 2007 at 11:53:37AM +0900, Paul Mundt wrote:
> On Tue, Jun 12, 2007 at 10:32:34AM -0500, Matt Mackall wrote:
> > On Tue, Jun 12, 2007 at 06:43:59PM +0900, Paul Mundt wrote:
> > > On Fri, Jun 08, 2007 at 09:50:11AM -0500, Matt Mackall wrote:
> > > > Haven't given any thought to NUMA yet though..
> > > > 
> > > This is what I've hacked together and tested with my small nodes. It's
> > > not terribly intelligent, and it pushes off most of the logic to the page
> > > allocator. Obviously it's not terribly scalable, and I haven't tested it
> > > with page migration, either. Still, it works for me with my simple tmpfs
> > > + mpol policy tests.
> > > 
> > > Tested on a UP + SPARSEMEM (static, not extreme) + NUMA (2 nodes) + SLOB
> > > configuration.
> > > 
> > > Flame away!
> > 
> > For starters, it's not against the current SLOB, which no longer has
> > the bigblock list.
> > 
> Sorry about that, seems I used the wrong tree.
> 
> > > -void *__kmalloc(size_t size, gfp_t gfp)
> > > +static void *__kmalloc_alloc(size_t size, gfp_t gfp, int node)
> > 
> > That's a ridiculous name. So, uh.. more underbars!
> > 
> Agreed, though I couldn't think of a better one.
> 
> > Though really, I think you can just name it __kmalloc_node?
> > 
> No, kmalloc_node and __kmalloc_node are both required by CONFIG_NUMA,
> otherwise that would have been the logical choice.

What I'm suggesting is: _always_ have __kmalloc_node and have
__kmalloc be a trivial inline that calls it. Together with cleaning up
the following piece, it may compile down to what we currently have on UP/SMP:

> > > +		if (node == -1)
> > > +			pages = alloc_pages(flags, get_order(c->size));
> > > +		else
> > > +			pages = alloc_pages_node(node, flags,
> > > +						get_order(c->size));
> > 
> > This fragment appears a few times. Looks like it ought to get its own
> > function. And that function can reduce to a trivial inline in the
> > !NUMA case.
> > 
> Ok.
> 
> > > +void *kmem_cache_alloc_node(struct kmem_cache *c, gfp_t flags, int node)
> > > +{
> > > +	return __kmem_cache_alloc(c, flags, node);
> > > +}
> > 
> > If we make the underlying functions all take a node, this stuff all
> > gets simpler.
> > 
> Could you elaborate on that?

See above. Just make the non-node versions wrappers around the node
versions everywhere.

-- 
Mathematics is the supreme nostalgia of our time.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2007-06-13  3:16 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-06-07  1:17 [PATCH] numa: mempolicy: dynamic interleave map for system init Paul Mundt
2007-06-08  1:01 ` Andrew Morton
2007-06-08  2:47   ` Christoph Lameter
2007-06-08  3:01     ` Andrew Morton
2007-06-08  3:11       ` Christoph Lameter
2007-06-08  3:25     ` Paul Mundt
2007-06-08  3:49       ` Christoph Lameter
2007-06-08  4:13         ` Paul Mundt
2007-06-08  4:27           ` Christoph Lameter
2007-06-08  6:05             ` Paul Mundt
2007-06-08  6:09               ` Christoph Lameter
2007-06-08  6:27                 ` Paul Mundt
2007-06-08  6:43                   ` Christoph Lameter
2007-06-08 14:50       ` Matt Mackall
2007-06-12  2:36         ` Nick Piggin
2007-06-12  9:43         ` Paul Mundt
2007-06-12 15:32           ` Matt Mackall
2007-06-13  2:10             ` Nick Piggin
2007-06-13  3:12               ` Matt Mackall
2007-06-13  2:53             ` Paul Mundt
2007-06-13  3:16               ` Matt Mackall

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox