linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* Re: [patch 10/12] Memoryless nodes: Update memory policy and page migration
       [not found] ` <20070711182252.138829364@sgi.com>
@ 2007-07-11 18:46   ` Nishanth Aravamudan
  2007-07-11 18:56     ` Christoph Lameter
  0 siblings, 1 reply; 48+ messages in thread
From: Nishanth Aravamudan @ 2007-07-11 18:46 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: akpm, kxr, linux-mm, Lee Schermerhorn, KAMEZAWA Hiroyuki

On 11.07.2007 [11:22:29 -0700], Christoph Lameter wrote:
> Online nodes now may have no memory. The checks and initialization must therefore
> be changed to no longer use the online functions.
> 
> This will correctly initialize the interleave on bootup to only target
> nodes with memory and will make sys_move_pages return an error when a page
> is to be moved to a memoryless node. Similarly we will get an error if
> MPOL_BIND and MPOL_INTERLEAVE is used on a memoryless node.
> 
> These are somewhat new semantics. So far one could specify memoryless nodes
> and we would maybe do the right thing and just ignore the node (or we'd do
> something strange like with MPOL_INTERLEAVE). If we want to allow the
> specification of memoryless nodes via memory policies then we need to keep
> checking for online nodes.
> 
> Signed-off-by: Christoph Lameter <clameter@sgi.com>
> Acked-by: Nishanth Aravamudan <nacc@us.ibm.com>
> 
> ---
>  mm/mempolicy.c |   10 +++++-----
>  mm/migrate.c   |    2 +-
>  2 files changed, 6 insertions(+), 6 deletions(-)
> 
> Index: linux-2.6.22-rc6-mm1/mm/migrate.c
> ===================================================================
> --- linux-2.6.22-rc6-mm1.orig/mm/migrate.c	2007-07-09 21:23:18.000000000 -0700
> +++ linux-2.6.22-rc6-mm1/mm/migrate.c	2007-07-11 10:37:03.000000000 -0700
> @@ -963,7 +963,7 @@ asmlinkage long sys_move_pages(pid_t pid
>  				goto out;
> 
>  			err = -ENODEV;
> -			if (!node_online(node))
> +			if (!node_memory(node))

			if (!node_state(node, N_MEMORY))

?

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [patch 10/12] Memoryless nodes: Update memory policy and page migration
  2007-07-11 18:46   ` [patch 10/12] Memoryless nodes: Update memory policy and page migration Nishanth Aravamudan
@ 2007-07-11 18:56     ` Christoph Lameter
  0 siblings, 0 replies; 48+ messages in thread
From: Christoph Lameter @ 2007-07-11 18:56 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: akpm, kxr, linux-mm, Lee Schermerhorn, KAMEZAWA Hiroyuki

On Wed, 11 Jul 2007, Nishanth Aravamudan wrote:

> > Index: linux-2.6.22-rc6-mm1/mm/migrate.c
> > ===================================================================
> > --- linux-2.6.22-rc6-mm1.orig/mm/migrate.c	2007-07-09 21:23:18.000000000 -0700
> > +++ linux-2.6.22-rc6-mm1/mm/migrate.c	2007-07-11 10:37:03.000000000 -0700
> > @@ -963,7 +963,7 @@ asmlinkage long sys_move_pages(pid_t pid
> >  				goto out;
> > 
> >  			err = -ENODEV;
> > -			if (!node_online(node))
> > +			if (!node_memory(node))
> 
> 			if (!node_state(node, N_MEMORY))
> 
> ?

Next patch fixes it up: :-=

Fixed up version


Memoryless nodes: Update memory policy and page migration

Online nodes now may have no memory. The checks and initialization must therefore
be changed to no longer use the online functions.

This will correctly initialize the interleave on bootup to only target
nodes with memory and will make sys_move_pages return an error when a page
is to be moved to a memoryless node. Similarly we will get an error if
MPOL_BIND and MPOL_INTERLEAVE is used on a memoryless node.

These are somewhat new semantics. So far one could specify memoryless nodes
and we would maybe do the right thing and just ignore the node (or we'd do
something strange like with MPOL_INTERLEAVE). If we want to allow the
specification of memoryless nodes via memory policies then we need to keep
checking for online nodes.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Acked-by: Nishanth Aravamudan <nacc@us.ibm.com>

---
 mm/mempolicy.c |   10 +++++-----
 mm/migrate.c   |    2 +-
 2 files changed, 6 insertions(+), 6 deletions(-)

Index: linux-2.6.22-rc6-mm1/mm/migrate.c
===================================================================
--- linux-2.6.22-rc6-mm1.orig/mm/migrate.c	2007-07-11 11:49:33.000000000 -0700
+++ linux-2.6.22-rc6-mm1/mm/migrate.c	2007-07-11 11:51:34.000000000 -0700
@@ -963,7 +963,7 @@ asmlinkage long sys_move_pages(pid_t pid
 				goto out;
 
 			err = -ENODEV;
-			if (!node_online(node))
+			if (!node_state(node, N_MEMORY))
 				goto out;
 
 			err = -EACCES;
Index: linux-2.6.22-rc6-mm1/mm/mempolicy.c
===================================================================
--- linux-2.6.22-rc6-mm1.orig/mm/mempolicy.c	2007-07-11 11:49:39.000000000 -0700
+++ linux-2.6.22-rc6-mm1/mm/mempolicy.c	2007-07-11 11:49:48.000000000 -0700
@@ -496,9 +496,9 @@ static void get_zonemask(struct mempolic
 		*nodes = p->v.nodes;
 		break;
 	case MPOL_PREFERRED:
-		/* or use current node instead of online map? */
+		/* or use current node instead of memory_map? */
 		if (p->v.preferred_node < 0)
-			*nodes = node_online_map;
+			*nodes = node_states[N_MEMORY];
 		else
 			node_set(p->v.preferred_node, *nodes);
 		break;
@@ -1618,7 +1618,7 @@ void __init numa_policy_init(void)
 	 * fall back to the largest node if they're all smaller.
 	 */
 	nodes_clear(interleave_nodes);
-	for_each_online_node(nid) {
+	for_each_node_state(nid, N_MEMORY) {
 		unsigned long total_pages = node_present_pages(nid);
 
 		/* Preserve the largest node */
@@ -1898,7 +1898,7 @@ int show_numa_map(struct seq_file *m, vo
 		seq_printf(m, " huge");
 	} else {
 		check_pgd_range(vma, vma->vm_start, vma->vm_end,
-				&node_online_map, MPOL_MF_STATS, md);
+				&node_states[N_MEMORY], MPOL_MF_STATS, md);
 	}
 
 	if (!md->pages)
@@ -1925,7 +1925,7 @@ int show_numa_map(struct seq_file *m, vo
 	if (md->writeback)
 		seq_printf(m," writeback=%lu", md->writeback);
 
-	for_each_online_node(n)
+	for_each_node_state(n, N_MEMORY)
 		if (md->node[n])
 			seq_printf(m, " N%d=%lu", n, md->node[n]);
 out:

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [patch 11/12] Add N_CPU node state
       [not found] ` <20070711182252.376540447@sgi.com>
@ 2007-07-11 19:04   ` Christoph Lameter
  0 siblings, 0 replies; 48+ messages in thread
From: Christoph Lameter @ 2007-07-11 19:04 UTC (permalink / raw)
  To: akpm
  Cc: kxr, linux-mm, Nishanth Aravamudan, Lee Schermerhorn, KAMEZAWA Hiroyuki

On Wed, 11 Jul 2007, Christoph Lameter wrote:

> Index: linux-2.6.22-rc6-mm1/mm/migrate.c
> ===================================================================
> --- linux-2.6.22-rc6-mm1.orig/mm/migrate.c	2007-07-11 10:39:28.000000000 -0700
> +++ linux-2.6.22-rc6-mm1/mm/migrate.c	2007-07-11 10:39:38.000000000 -0700
> @@ -963,7 +963,7 @@ asmlinkage long sys_move_pages(pid_t pid
>  				goto out;
>  
>  			err = -ENODEV;
> -			if (!node_memory(node))
> +			if (!node_state(node, N_MEMORY))
>  				goto out;
>  

Papers over the last patch and first patch. Patch w/o those two chunks


Add N_CPU node state

We need the check for a node with cpu in zone reclaim. Zone reclaim will not
allow remote zone reclaim if a node has a cpu.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 include/linux/nodemask.h |    1 +
 mm/page_alloc.c          |    4 +++-
 mm/vmscan.c              |    4 +---
 3 files changed, 5 insertions(+), 4 deletions(-)

Index: linux-2.6.22-rc6-mm1/include/linux/nodemask.h
===================================================================
--- linux-2.6.22-rc6-mm1.orig/include/linux/nodemask.h	2007-07-11 12:00:29.000000000 -0700
+++ linux-2.6.22-rc6-mm1/include/linux/nodemask.h	2007-07-11 12:01:10.000000000 -0700
@@ -344,6 +344,7 @@ enum node_states {
 	N_POSSIBLE,	/* The node could become online at some point */
 	N_ONLINE,	/* The node is online */
 	N_MEMORY,	/* The node has memory */
+	N_CPU,		/* The node has cpus */
 	NR_NODE_STATES
 };
 
Index: linux-2.6.22-rc6-mm1/mm/vmscan.c
===================================================================
--- linux-2.6.22-rc6-mm1.orig/mm/vmscan.c	2007-07-11 12:00:45.000000000 -0700
+++ linux-2.6.22-rc6-mm1/mm/vmscan.c	2007-07-11 12:01:10.000000000 -0700
@@ -1851,7 +1851,6 @@ static int __zone_reclaim(struct zone *z
 
 int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 {
-	cpumask_t mask;
 	int node_id;
 
 	/*
@@ -1888,8 +1887,7 @@ int zone_reclaim(struct zone *zone, gfp_
 	 * as wide as possible.
 	 */
 	node_id = zone_to_nid(zone);
-	mask = node_to_cpumask(node_id);
-	if (!cpus_empty(mask) && node_id != numa_node_id())
+	if (node_state(node_id, N_CPU) && node_id != numa_node_id())
 		return 0;
 	return __zone_reclaim(zone, gfp_mask, order);
 }
Index: linux-2.6.22-rc6-mm1/mm/page_alloc.c
===================================================================
--- linux-2.6.22-rc6-mm1.orig/mm/page_alloc.c	2007-07-11 12:00:29.000000000 -0700
+++ linux-2.6.22-rc6-mm1/mm/page_alloc.c	2007-07-11 12:01:10.000000000 -0700
@@ -2728,6 +2728,7 @@ static struct per_cpu_pageset boot_pages
 static int __cpuinit process_zones(int cpu)
 {
 	struct zone *zone, *dzone;
+	int node = cpu_to_node(cpu);
 
 	for_each_zone(zone) {
 
@@ -2735,7 +2736,7 @@ static int __cpuinit process_zones(int c
 			continue;
 
 		zone_pcp(zone, cpu) = kmalloc_node(sizeof(struct per_cpu_pageset),
-					 GFP_KERNEL, cpu_to_node(cpu));
+					 GFP_KERNEL, node);
 		if (!zone_pcp(zone, cpu))
 			goto bad;
 
@@ -2746,6 +2747,7 @@ static int __cpuinit process_zones(int c
 			 	(zone->present_pages / percpu_pagelist_fraction));
 	}
 
+	node_set_state(node, N_CPU);
 	return 0;
 bad:
 	for_each_zone(dzone) {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [patch 01/12] NUMA: Generic management of nodemasks for various purposes
       [not found] ` <20070711182250.005856256@sgi.com>
@ 2007-07-11 19:06   ` Christoph Lameter
  2007-07-11 19:32     ` Lee Schermerhorn
                       ` (4 more replies)
  0 siblings, 5 replies; 48+ messages in thread
From: Christoph Lameter @ 2007-07-11 19:06 UTC (permalink / raw)
  To: akpm
  Cc: kxr, linux-mm, Nishanth Aravamudan, Lee Schermerhorn, KAMEZAWA Hiroyuki

On Wed, 11 Jul 2007, Christoph Lameter wrote:

> -EXPORT_SYMBOL(node_possible_map);
> +nodemask_t node_states[NR_NODE_STATES] __read_mostly = {
> +	[N_POSSIBLE] => NODE_MASK_ALL,
> +	[N_ONLINE] =>{ { [0] = 1UL } }
> +};
> +EXPORT_SYMBOL(node_states);

Crap here too. I desperately need a vacation. Next week....


NUMA: Generic management of nodemasks for various purposes

Provide a generic way to keep nodemasks describing various characteristics
of NUMA nodes.

Remove the node_online_map and the node_possible map and realize the whole
thing using two nodes stats: N_POSSIBLE and N_ONLINE.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 include/linux/nodemask.h |   87 ++++++++++++++++++++++++++++++++++++++---------
 mm/page_alloc.c          |   13 +++----
 2 files changed, 78 insertions(+), 22 deletions(-)

Index: linux-2.6.22-rc6-mm1/include/linux/nodemask.h
===================================================================
--- linux-2.6.22-rc6-mm1.orig/include/linux/nodemask.h	2007-07-11 11:31:30.000000000 -0700
+++ linux-2.6.22-rc6-mm1/include/linux/nodemask.h	2007-07-11 11:59:08.000000000 -0700
@@ -338,31 +338,81 @@ static inline void __nodes_remap(nodemas
 #endif /* MAX_NUMNODES */
 
 /*
+ * Bitmasks that are kept for all the nodes.
+ */
+enum node_states {
+	N_POSSIBLE,	/* The node could become online at some point */
+	N_ONLINE,	/* The node is online */
+	NR_NODE_STATES
+};
+
+/*
  * The following particular system nodemasks and operations
  * on them manage all possible and online nodes.
  */
 
-extern nodemask_t node_online_map;
-extern nodemask_t node_possible_map;
+extern nodemask_t node_states[NR_NODE_STATES];
 
 #if MAX_NUMNODES > 1
-#define num_online_nodes()	nodes_weight(node_online_map)
-#define num_possible_nodes()	nodes_weight(node_possible_map)
-#define node_online(node)	node_isset((node), node_online_map)
-#define node_possible(node)	node_isset((node), node_possible_map)
-#define first_online_node	first_node(node_online_map)
-#define next_online_node(nid)	next_node((nid), node_online_map)
+static inline int node_state(int node, enum node_states state)
+{
+	return node_isset(node, node_states[state]);
+}
+
+static inline void node_set_state(int node, enum node_states state)
+{
+	__node_set(node, &node_states[state]);
+}
+
+static inline void node_clear_state(int node, enum node_states state)
+{
+	__node_clear(node, &node_states[state]);
+}
+
+static inline int num_node_state(enum node_states state)
+{
+	return nodes_weight(node_states[state]);
+}
+
+#define for_each_node_state(__node, __state) \
+	for_each_node_mask((__node), node_states[__state])
+
+#define first_online_node	first_node(node_states[N_ONLINE])
+#define next_online_node(nid)	next_node((nid), node_states[N_ONLINE])
+
 extern int nr_node_ids;
 #else
-#define num_online_nodes()	1
-#define num_possible_nodes()	1
-#define node_online(node)	((node) == 0)
-#define node_possible(node)	((node) == 0)
+
+static inline int node_state(int node, enum node_states state)
+{
+	return node == 0;
+}
+
+static inline void node_set_state(int node, enum node_states state)
+{
+}
+
+static inline void node_clear_state(int node, enum node_states state)
+{
+}
+
+static inline int num_node_state(enum node_states state)
+{
+	return 1;
+}
+
+#define for_each_node_state(node, __state) \
+	for ( (node) = 0; (node) != 0; (node) = 1)
+
 #define first_online_node	0
 #define next_online_node(nid)	(MAX_NUMNODES)
 #define nr_node_ids		1
+
 #endif
 
+#define node_online_map 	node_states[N_ONLINE]
+#define node_possible_map 	node_states[N_POSSIBLE]
+
 #define any_online_node(mask)			\
 ({						\
 	int node;				\
@@ -372,10 +422,15 @@ extern int nr_node_ids;
 	node;					\
 })
 
-#define node_set_online(node)	   set_bit((node), node_online_map.bits)
-#define node_set_offline(node)	   clear_bit((node), node_online_map.bits)
+#define num_online_nodes()	num_node_state(N_ONLINE)
+#define num_possible_nodes()	num_node_state(N_POSSIBLE)
+#define node_online(node)	node_state((node), N_ONLINE)
+#define node_possible(node)	node_state((node), N_POSSIBLE)
+
+#define node_set_online(node)	   node_set_state((node), N_ONLINE)
+#define node_set_offline(node)	   node_clear_state((node), N_ONLINE)
 
-#define for_each_node(node)	   for_each_node_mask((node), node_possible_map)
-#define for_each_online_node(node) for_each_node_mask((node), node_online_map)
+#define for_each_node(node)	   for_each_node_state(node, N_POSSIBLE)
+#define for_each_online_node(node) for_each_node_state(node, N_ONLINE)
 
 #endif /* __LINUX_NODEMASK_H */
Index: linux-2.6.22-rc6-mm1/mm/page_alloc.c
===================================================================
--- linux-2.6.22-rc6-mm1.orig/mm/page_alloc.c	2007-07-11 11:49:34.000000000 -0700
+++ linux-2.6.22-rc6-mm1/mm/page_alloc.c	2007-07-11 11:59:50.000000000 -0700
@@ -47,13 +47,14 @@
 #include "internal.h"
 
 /*
- * MCD - HACK: Find somewhere to initialize this EARLY, or make this
- * initializer cleaner
+ * Array of node states.
  */
-nodemask_t node_online_map __read_mostly = { { [0] = 1UL } };
-EXPORT_SYMBOL(node_online_map);
-nodemask_t node_possible_map __read_mostly = NODE_MASK_ALL;
-EXPORT_SYMBOL(node_possible_map);
+nodemask_t node_states[NR_NODE_STATES] __read_mostly = {
+	[N_POSSIBLE] = NODE_MASK_ALL,
+	[N_ONLINE] = { { [0] = 1UL } }
+};
+EXPORT_SYMBOL(node_states);
+
 unsigned long totalram_pages __read_mostly;
 unsigned long totalreserve_pages __read_mostly;
 long nr_swap_pages;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [patch 01/12] NUMA: Generic management of nodemasks for various purposes
  2007-07-11 19:06   ` [patch 01/12] NUMA: Generic management of nodemasks for various purposes Christoph Lameter
@ 2007-07-11 19:32     ` Lee Schermerhorn
  2007-07-20 20:49     ` [PATCH] Memoryless nodes: use "node_memory_map" for cpuset mems_allowed validation Lee Schermerhorn
                       ` (3 subsequent siblings)
  4 siblings, 0 replies; 48+ messages in thread
From: Lee Schermerhorn @ 2007-07-11 19:32 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: akpm, kxr, linux-mm, Nishanth Aravamudan, KAMEZAWA Hiroyuki

On Wed, 2007-07-11 at 12:06 -0700, Christoph Lameter wrote:
> On Wed, 11 Jul 2007, Christoph Lameter wrote:
> 
> > -EXPORT_SYMBOL(node_possible_map);
> > +nodemask_t node_states[NR_NODE_STATES] __read_mostly = {
> > +	[N_POSSIBLE] => NODE_MASK_ALL,
> > +	[N_ONLINE] =>{ { [0] = 1UL } }
> > +};
> > +EXPORT_SYMBOL(node_states);
> 
> Crap here too. I desperately need a vacation. Next week....

Hi, Christoph.

I've grabbed your patch set [trying to keep track of updates ;-)].  I'll
test on various platforms here over the next couple of days and let you
know what I find.

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [patch 07/12] Memoryless nodes: SLUB support
       [not found] ` <20070711182251.433134748@sgi.com>
@ 2007-07-12  0:07   ` Andrew Morton
  2007-07-12  1:42     ` Christoph Lameter
  0 siblings, 1 reply; 48+ messages in thread
From: Andrew Morton @ 2007-07-12  0:07 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: kxr, linux-mm, Nishanth Aravamudan, Lee Schermerhorn, KAMEZAWA Hiroyuki

On Wed, 11 Jul 2007 11:22:26 -0700
Christoph Lameter <clameter@sgi.com> wrote:

> Simply switch all for_each_online_node to for_each_memory_node. That way
> SLUB only operates on nodes with memory. Any allocation attempt on a
> memoryless node will fall whereupon SLUB will fetch memory from a nearby
> node (depending on how memory policies and cpuset describe fallback).
> 

This is as far as I got when a reject storm hit.

> -	for_each_online_node(node)
> +	for_each_node_state(node, N_MEMORY)
>  		__kmem_cache_shrink(s, get_node(s, node), scratch);

I can find no sign of any __kmem_cache_shrink's anywhere.

Let's park all this until post-merge-window please.  Generally, now is not
a good time for me to be merging 2.6.24 stuff.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [patch 07/12] Memoryless nodes: SLUB support
  2007-07-12  0:07   ` [patch 07/12] Memoryless nodes: SLUB support Andrew Morton
@ 2007-07-12  1:42     ` Christoph Lameter
  2007-07-12 18:33       ` Nishanth Aravamudan
  0 siblings, 1 reply; 48+ messages in thread
From: Christoph Lameter @ 2007-07-12  1:42 UTC (permalink / raw)
  To: Andrew Morton
  Cc: kxr, linux-mm, Nishanth Aravamudan, Lee Schermerhorn, KAMEZAWA Hiroyuki

On Wed, 11 Jul 2007, Andrew Morton wrote:

> This is as far as I got when a reject storm hit.
> 
> > -	for_each_online_node(node)
> > +	for_each_node_state(node, N_MEMORY)
> >  		__kmem_cache_shrink(s, get_node(s, node), scratch);
> 
> I can find no sign of any __kmem_cache_shrink's anywhere.

Yup I expected slab defrag to be merged first before you get to this.
 
> Let's park all this until post-merge-window please.  Generally, now is not
> a good time for me to be merging 2.6.24 stuff.

For SGI this is not important at all since we have no memoryless nodes. 

However, these fixes are important for other NUMA users. I think this 
needs to go into 2.6.23 for correctnesses sake. We may have some fun with 
it since the fixed up behavior of GFP_THISNODE may expose additional 
problems in how subsystems handle memoryless nodes (and I do not have 
such a system). There are also patches against hugetlb that use this 
functionality here.

Necessary for asymmetric NUMA configs to work right.


Here is the patch rediffed before slab defrag.


Memoryless nodes: SLUB support

Simply switch all for_each_online_node to for_each_memory_node. That way
SLUB only operates on nodes with memory. Any allocation attempt on a
memoryless node will fall whereupon SLUB will fetch memory from a nearby
node (depending on how memory policies and cpuset describe fallback).

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 mm/slub.c |   16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

Index: linux-2.6.22-rc6-mm1/mm/slub.c
===================================================================
--- linux-2.6.22-rc6-mm1.orig/mm/slub.c	2007-07-11 18:31:55.000000000 -0700
+++ linux-2.6.22-rc6-mm1/mm/slub.c	2007-07-11 18:33:27.000000000 -0700
@@ -1914,7 +1914,7 @@ static void free_kmem_cache_nodes(struct
 {
 	int node;
 
-	for_each_online_node(node) {
+	for_each_node_state(node, N_MEMORY) {
 		struct kmem_cache_node *n = s->node[node];
 		if (n && n != &s->local_node)
 			kmem_cache_free(kmalloc_caches, n);
@@ -1932,7 +1932,7 @@ static int init_kmem_cache_nodes(struct 
 	else
 		local_node = 0;
 
-	for_each_online_node(node) {
+	for_each_node_state(node, N_MEMORY) {
 		struct kmem_cache_node *n;
 
 		if (local_node == node)
@@ -2185,7 +2185,7 @@ static inline int kmem_cache_close(struc
 	flush_all(s);
 
 	/* Attempt to free all objects */
-	for_each_online_node(node) {
+	for_each_node_state(node, N_MEMORY) {
 		struct kmem_cache_node *n = get_node(s, node);
 
 		n->nr_partial -= free_list(s, n, &n->partial);
@@ -2480,7 +2480,7 @@ int kmem_cache_shrink(struct kmem_cache 
 		return -ENOMEM;
 
 	flush_all(s);
-	for_each_online_node(node) {
+	for_each_node_state(node, N_MEMORY) {
 		n = get_node(s, node);
 
 		if (!n->nr_partial)
@@ -2886,7 +2886,7 @@ static long validate_slab_cache(struct k
 		return -ENOMEM;
 
 	flush_all(s);
-	for_each_online_node(node) {
+	for_each_node_state(node, N_MEMORY) {
 		struct kmem_cache_node *n = get_node(s, node);
 
 		count += validate_slab_node(s, n, map);
@@ -3106,7 +3106,7 @@ static int list_locations(struct kmem_ca
 	/* Push back cpu slabs */
 	flush_all(s);
 
-	for_each_online_node(node) {
+	for_each_node_state(node, N_MEMORY) {
 		struct kmem_cache_node *n = get_node(s, node);
 		unsigned long flags;
 		struct page *page;
@@ -3233,7 +3233,7 @@ static unsigned long slab_objects(struct
 		}
 	}
 
-	for_each_online_node(node) {
+	for_each_node_state(node, N_MEMORY) {
 		struct kmem_cache_node *n = get_node(s, node);
 
 		if (flags & SO_PARTIAL) {
@@ -3261,7 +3261,7 @@ static unsigned long slab_objects(struct
 
 	x = sprintf(buf, "%lu", total);
 #ifdef CONFIG_NUMA
-	for_each_online_node(node)
+	for_each_node_state(node, N_MEMORY)
 		if (nodes[node])
 			x += sprintf(buf + x, " N%d=%lu",
 					node, nodes[node]);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [patch 07/12] Memoryless nodes: SLUB support
  2007-07-12  1:42     ` Christoph Lameter
@ 2007-07-12 18:33       ` Nishanth Aravamudan
  2007-07-12 18:38         ` Christoph Lameter
  0 siblings, 1 reply; 48+ messages in thread
From: Nishanth Aravamudan @ 2007-07-12 18:33 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, kxr, linux-mm, Lee Schermerhorn, KAMEZAWA Hiroyuki

On 11.07.2007 [18:42:52 -0700], Christoph Lameter wrote:
> On Wed, 11 Jul 2007, Andrew Morton wrote:
> 
> > This is as far as I got when a reject storm hit.
> > 
> > > -	for_each_online_node(node)
> > > +	for_each_node_state(node, N_MEMORY)
> > >  		__kmem_cache_shrink(s, get_node(s, node), scratch);
> > 
> > I can find no sign of any __kmem_cache_shrink's anywhere.
> 
> Yup I expected slab defrag to be merged first before you get to this.
> 
> > Let's park all this until post-merge-window please.  Generally, now
> > is not a good time for me to be merging 2.6.24 stuff.
> 
> For SGI this is not important at all since we have no memoryless
> nodes. 

Right the original problem that brought this up again was a power
machine with two empty nodes displaying incorrect interleaving of
hugepages.

> However, these fixes are important for other NUMA users. I think this
> needs to go into 2.6.23 for correctnesses sake. We may have some fun
> with it since the fixed up behavior of GFP_THISNODE may expose
> additional problems in how subsystems handle memoryless nodes (and I
> do not have such a system). There are also patches against hugetlb
> that use this functionality here.

I was waiting for this series to stabilize a bit before rebasing my
patch to fix the hugetlb interleaving with memoryless nodes. I also have
two patches on top of that which add a per-node sysfs nr_hugepages
attribute and also depend on the patch to make THISNODE allocations
stay on the current node from this series.

> Necessary for asymmetric NUMA configs to work right.
> 
> 
> Here is the patch rediffed before slab defrag.
> 
> 
> Memoryless nodes: SLUB support
> 
> Simply switch all for_each_online_node to for_each_memory_node. That way
> SLUB only operates on nodes with memory. Any allocation attempt on a
> memoryless node will fall whereupon SLUB will fetch memory from a nearby
> node (depending on how memory policies and cpuset describe fallback).

This description is out of date. There is no for_each_memory_node() any
more, I think you meant for_each_node_state().

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [patch 07/12] Memoryless nodes: SLUB support
  2007-07-12 18:33       ` Nishanth Aravamudan
@ 2007-07-12 18:38         ` Christoph Lameter
  0 siblings, 0 replies; 48+ messages in thread
From: Christoph Lameter @ 2007-07-12 18:38 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Andrew Morton, kxr, linux-mm, Lee Schermerhorn, KAMEZAWA Hiroyuki

On Thu, 12 Jul 2007, Nishanth Aravamudan wrote:

> This description is out of date. There is no for_each_memory_node() any
> more, I think you meant for_each_node_state().

Correct.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [patch 00/12] NUMA: Memoryless node support V3
       [not found] <20070711182219.234782227@sgi.com>
                   ` (3 preceding siblings ...)
       [not found] ` <20070711182251.433134748@sgi.com>
@ 2007-07-13 15:14 ` Nishanth Aravamudan
  2007-07-13 16:43   ` Christoph Lameter
  4 siblings, 1 reply; 48+ messages in thread
From: Nishanth Aravamudan @ 2007-07-13 15:14 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: akpm, kxr, linux-mm, Lee Schermerhorn, KAMEZAWA Hiroyuki

On 11.07.2007 [11:22:19 -0700], Christoph Lameter wrote:
> Changes V2->V3:
> - Refresh patches (sigh)
> - Add comments suggested by Kamezawa Hiroyuki
> - Add signoff by Jes Sorensen

Christoph, would it be possible to get the current patches up on
kernel.org in your people-space? That way I know I have the current
versions of these, including any fixlets that come by?

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [patch 00/12] NUMA: Memoryless node support V3
  2007-07-13 15:14 ` [patch 00/12] NUMA: Memoryless node support V3 Nishanth Aravamudan
@ 2007-07-13 16:43   ` Christoph Lameter
  2007-07-13 16:52     ` Nishanth Aravamudan
                       ` (2 more replies)
  0 siblings, 3 replies; 48+ messages in thread
From: Christoph Lameter @ 2007-07-13 16:43 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Nishanth Aravamudan, akpm, kxr, linux-mm, KAMEZAWA Hiroyuki

On Fri, 13 Jul 2007, Nishanth Aravamudan wrote:

> On 11.07.2007 [11:22:19 -0700], Christoph Lameter wrote:
> > Changes V2->V3:
> > - Refresh patches (sigh)
> > - Add comments suggested by Kamezawa Hiroyuki
> > - Add signoff by Jes Sorensen
> 
> Christoph, would it be possible to get the current patches up on
> kernel.org in your people-space? That way I know I have the current
> versions of these, including any fixlets that come by?

Lee: Would you repost the patches after testing them and fixing them up? 

You probably have somewhere to publish them? I will be on vacation next 
week (and yes I will leave my laptop at home, somehow I have to get back 
my sanity).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [patch 00/12] NUMA: Memoryless node support V3
  2007-07-13 16:43   ` Christoph Lameter
@ 2007-07-13 16:52     ` Nishanth Aravamudan
  2007-07-13 17:20     ` Lee Schermerhorn
       [not found]     ` <1185310277.5649.90.camel@localhost>
  2 siblings, 0 replies; 48+ messages in thread
From: Nishanth Aravamudan @ 2007-07-13 16:52 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Lee Schermerhorn, akpm, kxr, linux-mm, KAMEZAWA Hiroyuki

On 13.07.2007 [09:43:25 -0700], Christoph Lameter wrote:
> On Fri, 13 Jul 2007, Nishanth Aravamudan wrote:
> 
> > On 11.07.2007 [11:22:19 -0700], Christoph Lameter wrote:
> > > Changes V2->V3:
> > > - Refresh patches (sigh)
> > > - Add comments suggested by Kamezawa Hiroyuki
> > > - Add signoff by Jes Sorensen
> > 
> > Christoph, would it be possible to get the current patches up on
> > kernel.org in your people-space? That way I know I have the current
> > versions of these, including any fixlets that come by?
> 
> Lee: Would you repost the patches after testing them and fixing them up? 

That will work too.

> You probably have somewhere to publish them? I will be on vacation
> next week (and yes I will leave my laptop at home, somehow I have to
> get back my sanity).

Enjoy your vacation and good luck with the sanity :) Thanks again for
working through these memoryless node issues.

-Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [patch 00/12] NUMA: Memoryless node support V3
  2007-07-13 16:43   ` Christoph Lameter
  2007-07-13 16:52     ` Nishanth Aravamudan
@ 2007-07-13 17:20     ` Lee Schermerhorn
  2007-07-13 17:23       ` Christoph Lameter
       [not found]     ` <1185310277.5649.90.camel@localhost>
  2 siblings, 1 reply; 48+ messages in thread
From: Lee Schermerhorn @ 2007-07-13 17:20 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Nishanth Aravamudan, akpm, kxr, linux-mm, KAMEZAWA Hiroyuki

On Fri, 2007-07-13 at 09:43 -0700, Christoph Lameter wrote:
> On Fri, 13 Jul 2007, Nishanth Aravamudan wrote:
> 
> > On 11.07.2007 [11:22:19 -0700], Christoph Lameter wrote:
> > > Changes V2->V3:
> > > - Refresh patches (sigh)
> > > - Add comments suggested by Kamezawa Hiroyuki
> > > - Add signoff by Jes Sorensen
> > 
> > Christoph, would it be possible to get the current patches up on
> > kernel.org in your people-space? That way I know I have the current
> > versions of these, including any fixlets that come by?
> 
> Lee: Would you repost the patches after testing them and fixing them up? 

I'm up to my eyeballs right now, setting up a large system for testing
VM scalability with Oracle.  I hope to have time early next week to test
your patches.  In a mail exchange between you and Andrew, you mentioned
that your memoryless-node patches are atop your slab defrag?  Shall I
test them that way?  Or try to rebase against the then current -mm tree?
I.e., what's the probability that the slab defrag patches make it into
-mm before the memoryless node patches?

> 
> You probably have somewhere to publish them? I will be on vacation next 
> week (and yes I will leave my laptop at home, somehow I have to get back 
> my sanity).

You mean in addition to posting?  I can stick a copy on my
free.linux.hp.com http site.

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [patch 00/12] NUMA: Memoryless node support V3
  2007-07-13 17:20     ` Lee Schermerhorn
@ 2007-07-13 17:23       ` Christoph Lameter
  2007-07-13 19:22         ` Lee Schermerhorn
  2007-07-13 20:53         ` Lee Schermerhorn
  0 siblings, 2 replies; 48+ messages in thread
From: Christoph Lameter @ 2007-07-13 17:23 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Nishanth Aravamudan, akpm, kxr, linux-mm, KAMEZAWA Hiroyuki

On Fri, 13 Jul 2007, Lee Schermerhorn wrote:

> I'm up to my eyeballs right now, setting up a large system for testing
> VM scalability with Oracle.  I hope to have time early next week to test
> your patches.  In a mail exchange between you and Andrew, you mentioned
> that your memoryless-node patches are atop your slab defrag?  Shall I
> test them that way?  Or try to rebase against the then current -mm tree?

You can skip the slab defrag. I posted a rediffed patch in my response 
to Andrew. Use that one.

> I.e., what's the probability that the slab defrag patches make it into
> -mm before the memoryless node patches?

No idea. Use the patch that does not rely on slab defrag.

> > You probably have somewhere to publish them? I will be on vacation next 
> > week (and yes I will leave my laptop at home, somehow I have to get back 
> > my sanity).
> 
> You mean in addition to posting?  I can stick a copy on my
> free.linux.hp.com http site.

Yes. Seems that many people want that.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [patch 00/12] NUMA: Memoryless node support V3
  2007-07-13 17:23       ` Christoph Lameter
@ 2007-07-13 19:22         ` Lee Schermerhorn
  2007-07-13 20:53         ` Lee Schermerhorn
  1 sibling, 0 replies; 48+ messages in thread
From: Lee Schermerhorn @ 2007-07-13 19:22 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Nishanth Aravamudan, akpm, kxr, linux-mm, KAMEZAWA Hiroyuki

On Fri, 2007-07-13 at 10:23 -0700, Christoph Lameter wrote:
> On Fri, 13 Jul 2007, Lee Schermerhorn wrote:
> 
> > I'm up to my eyeballs right now, setting up a large system for testing
> > VM scalability with Oracle.  I hope to have time early next week to test
> > your patches.  In a mail exchange between you and Andrew, you mentioned
> > that your memoryless-node patches are atop your slab defrag?  Shall I
> > test them that way?  Or try to rebase against the then current -mm tree?
> 
> You can skip the slab defrag. I posted a rediffed patch in my response 
> to Andrew. Use that one.
> 

OK, I see the rebased patch 7/12 [SLUB support] in your response to
Andrew.  

Thanks,
Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [patch 00/12] NUMA: Memoryless node support V3
  2007-07-13 17:23       ` Christoph Lameter
  2007-07-13 19:22         ` Lee Schermerhorn
@ 2007-07-13 20:53         ` Lee Schermerhorn
  2007-07-13 21:34           ` Christoph Lameter
  2007-07-13 23:18           ` Nishanth Aravamudan
  1 sibling, 2 replies; 48+ messages in thread
From: Lee Schermerhorn @ 2007-07-13 20:53 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Nishanth Aravamudan, akpm, kxr, linux-mm, KAMEZAWA Hiroyuki

Had a chance to build/boot the latest series with the updated #7, ...

Quite a few offsets and one reject in #7, but easy to resolve.

Boots OK.  Quick test of hugetlb allocation on my platform shows the old
behavior with huge pages doubling up on the node that the "memoryless"
one falls back on.  Guess this is expected until we get Nish's patch
atop this one.

Next week I'll reconfig a platform fully interleaved which will result
in all of the real nodes appearing memoryless and do more testing.

Have a nice vacation.

Nish:

Shall I try to rebase your patches atop Christoph's in my tree?

The last ones I have are from 19jul:

	01-fix-hugetlb-pool-allocation-with-memoryless-nodes
	02-hugetlb-numafy-several-functions
	03-add-per-node-nr_hugepages-sysfs-attribute

Do you have more recent ones?

Lee


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [patch 00/12] NUMA: Memoryless node support V3
  2007-07-13 20:53         ` Lee Schermerhorn
@ 2007-07-13 21:34           ` Christoph Lameter
  2007-07-13 23:18           ` Nishanth Aravamudan
  1 sibling, 0 replies; 48+ messages in thread
From: Christoph Lameter @ 2007-07-13 21:34 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Nishanth Aravamudan, akpm, kxr, linux-mm, KAMEZAWA Hiroyuki

On Fri, 13 Jul 2007, Lee Schermerhorn wrote:

> Next week I'll reconfig a platform fully interleaved which will result
> in all of the real nodes appearing memoryless and do more testing.

Thanks.

Keith Rich (kxr) is working on nodeless memory support (which will be 
important to our upcoming product line) now for SGI. He is bit new to 
working with the community.  We probably need to support him a bit to get 
going. Please keep him cced on this.

> Have a nice vacation.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [patch 00/12] NUMA: Memoryless node support V3
  2007-07-13 20:53         ` Lee Schermerhorn
  2007-07-13 21:34           ` Christoph Lameter
@ 2007-07-13 23:18           ` Nishanth Aravamudan
  1 sibling, 0 replies; 48+ messages in thread
From: Nishanth Aravamudan @ 2007-07-13 23:18 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Christoph Lameter, akpm, kxr, linux-mm, KAMEZAWA Hiroyuki

On 13.07.2007 [16:53:52 -0400], Lee Schermerhorn wrote:
> Christoph:
> 
> Had a chance to build/boot the latest series with the updated #7, ...
> 
> Quite a few offsets and one reject in #7, but easy to resolve.
> 
> Boots OK.  Quick test of hugetlb allocation on my platform shows the
> old behavior with huge pages doubling up on the node that the
> "memoryless" one falls back on.  Guess this is expected until we get
> Nish's patch atop this one.

Yep, I've tested his stack as well (just got some results and it seems
ok).

> Next week I'll reconfig a platform fully interleaved which will result
> in all of the real nodes appearing memoryless and do more testing.
> 
> Have a nice vacation.
> 
> Nish:
> 
> Shall I try to rebase your patches atop Christoph's in my tree?

I've got all three of my patches rebased. I'll repost them shortly, am
just trying to verify they still work as expected on NUMA, NUMA w/
memoryless and non-NUMA, as before.

Thanks,
Nish

> The last ones I have are from 19jul:
> 
> 	01-fix-hugetlb-pool-allocation-with-memoryless-nodes

FWIW, you just need to

sed -i 's/node_memory_map/node_states(N_MEMORY)/g' mm/hugetlb.c

for this patch and

> 	02-hugetlb-numafy-several-functions
> 	03-add-per-node-nr_hugepages-sysfs-attribute

these two apply cleanly. Everything should build at that point, as well.

> Do you have more recent ones?

I'll make sure you're on the Cc once I repost.

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH] Memoryless nodes:  use "node_memory_map" for cpuset mems_allowed validation
  2007-07-11 19:06   ` [patch 01/12] NUMA: Generic management of nodemasks for various purposes Christoph Lameter
  2007-07-11 19:32     ` Lee Schermerhorn
@ 2007-07-20 20:49     ` Lee Schermerhorn
  2007-07-20 22:07       ` Nishanth Aravamudan
  2007-07-23 19:09       ` Nishanth Aravamudan
  2007-07-24 14:15     ` [PATCH take2] " Lee Schermerhorn
                       ` (2 subsequent siblings)
  4 siblings, 2 replies; 48+ messages in thread
From: Lee Schermerhorn @ 2007-07-20 20:49 UTC (permalink / raw)
  To: Christoph Lameter, Paul Jackson
  Cc: akpm, kxr, linux-mm, Nishanth Aravamudan, KAMEZAWA Hiroyuki

This fixes a problem I encountered testing Christoph's memoryless nodes
series.  Applies atop that series.  Other than this, series holds up
under what testing I've been able to do this week.

Memoryless Nodes:  use "node_memory_map" for cpusets mems_allowed validation

cpusets try to ensure that any node added to a cpuset's 
mems_allowed is on-line and contains memory.  The assumption
was that online nodes contained memory.  Thus, it is possible
to add memoryless nodes to a cpuset and then add tasks to this
cpuset.  This results in continuous series of oom-kill and other
console stack traces and apparent system hang.

Change cpusets to use node_states[N_MEMORY] [a.k.a.
node_memory_map] in place of node_online_map when vetting 
memories.  Return error if admin attempts to write a non-empty
mems_allowed node mask containing only memoryless-nodes.

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

 kernel/cpuset.c |   33 +++++++++++++++++++++++----------
 1 file changed, 23 insertions(+), 10 deletions(-)

Index: Linux/kernel/cpuset.c
===================================================================
--- Linux.orig/kernel/cpuset.c	2007-07-20 16:02:01.000000000 -0400
+++ Linux/kernel/cpuset.c	2007-07-20 16:27:46.000000000 -0400
@@ -316,26 +316,26 @@ static void guarantee_online_cpus(const 
 
 /*
  * Return in *pmask the portion of a cpusets's mems_allowed that
- * are online.  If none are online, walk up the cpuset hierarchy
- * until we find one that does have some online mems.  If we get
- * all the way to the top and still haven't found any online mems,
- * return node_online_map.
+ * are online, with memory.  If none are online with memory, walk
+ * up the cpuset hierarchy until we find one that does have some
+ * online mems.  If we get all the way to the top and still haven't
+ * found any online mems, return node_states[N_MEMORY].
  *
  * One way or another, we guarantee to return some non-empty subset
- * of node_online_map.
+ * of node_states[N_MEMORY].
  *
  * Call with callback_mutex held.
  */
 
 static void guarantee_online_mems(const struct cpuset *cs, nodemask_t *pmask)
 {
-	while (cs && !nodes_intersects(cs->mems_allowed, node_online_map))
+	while (cs && !nodes_intersects(cs->mems_allowed, node_states[N_MEMORY]))
 		cs = cs->parent;
 	if (cs)
-		nodes_and(*pmask, cs->mems_allowed, node_online_map);
+		nodes_and(*pmask, cs->mems_allowed, node_states[N_MEMORY]);
 	else
-		*pmask = node_online_map;
-	BUG_ON(!nodes_intersects(*pmask, node_online_map));
+		*pmask = node_states[N_MEMORY];
+	BUG_ON(!nodes_intersects(*pmask, node_states[N_MEMORY]));
 }
 
 /**
@@ -623,8 +623,21 @@ static int update_nodemask(struct cpuset
 		retval = nodelist_parse(buf, trialcs.mems_allowed);
 		if (retval < 0)
 			goto done;
+		if (!nodes_intersects(trialcs.mems_allowed,
+						node_states[N_MEMORY])) {
+			/*
+			 * error if only memoryless nodes specified.
+			 */
+			retval = -ENOSPC;
+			goto done;
+		}
 	}
-	nodes_and(trialcs.mems_allowed, trialcs.mems_allowed, node_online_map);
+	/*
+	 * Exclude memoryless nodes.  We know that trialcs.mems_allowed
+	 * contains at least one node with memory.
+	 */
+	nodes_and(trialcs.mems_allowed, trialcs.mems_allowed,
+						node_states[N_MEMORY]);
 	oldmem = cs->mems_allowed;
 	if (nodes_equal(oldmem, trialcs.mems_allowed)) {
 		retval = 0;		/* Too easy - nothing to do */


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] Memoryless nodes:  use "node_memory_map" for cpuset mems_allowed validation
  2007-07-20 20:49     ` [PATCH] Memoryless nodes: use "node_memory_map" for cpuset mems_allowed validation Lee Schermerhorn
@ 2007-07-20 22:07       ` Nishanth Aravamudan
  2007-07-23 19:09       ` Nishanth Aravamudan
  1 sibling, 0 replies; 48+ messages in thread
From: Nishanth Aravamudan @ 2007-07-20 22:07 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Christoph Lameter, Paul Jackson, akpm, kxr, linux-mm, KAMEZAWA Hiroyuki

On 20.07.2007 [16:49:24 -0400], Lee Schermerhorn wrote:
> This fixes a problem I encountered testing Christoph's memoryless nodes
> series.  Applies atop that series.  Other than this, series holds up
> under what testing I've been able to do this week.
> 
> Memoryless Nodes:  use "node_memory_map" for cpusets mems_allowed validation
> 
> cpusets try to ensure that any node added to a cpuset's 
> mems_allowed is on-line and contains memory.  The assumption
> was that online nodes contained memory.  Thus, it is possible
> to add memoryless nodes to a cpuset and then add tasks to this
> cpuset.  This results in continuous series of oom-kill and other
> console stack traces and apparent system hang.
> 
> Change cpusets to use node_states[N_MEMORY] [a.k.a.
> node_memory_map] in place of node_online_map when vetting 
> memories.  Return error if admin attempts to write a non-empty
> mems_allowed node mask containing only memoryless-nodes.

Yep, in cursorily looking at the hugetlb pool growing with cpusets (more
specifically at cpuset.c), I was thinking this would be necessary.

> Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

Acked-by: Nishanth Aravamudan <nacc@us.ibm.com>

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] Memoryless nodes:  use "node_memory_map" for cpuset mems_allowed validation
  2007-07-20 20:49     ` [PATCH] Memoryless nodes: use "node_memory_map" for cpuset mems_allowed validation Lee Schermerhorn
  2007-07-20 22:07       ` Nishanth Aravamudan
@ 2007-07-23 19:09       ` Nishanth Aravamudan
  2007-07-23 19:23         ` Paul Jackson
  2007-07-23 20:59         ` Lee Schermerhorn
  1 sibling, 2 replies; 48+ messages in thread
From: Nishanth Aravamudan @ 2007-07-23 19:09 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Christoph Lameter, Paul Jackson, akpm, kxr, linux-mm, KAMEZAWA Hiroyuki

On 20.07.2007 [16:49:24 -0400], Lee Schermerhorn wrote:
> This fixes a problem I encountered testing Christoph's memoryless nodes
> series.  Applies atop that series.  Other than this, series holds up
> under what testing I've been able to do this week.
> 
> Memoryless Nodes:  use "node_memory_map" for cpusets mems_allowed validation
> 
> cpusets try to ensure that any node added to a cpuset's 
> mems_allowed is on-line and contains memory.  The assumption
> was that online nodes contained memory.  Thus, it is possible
> to add memoryless nodes to a cpuset and then add tasks to this
> cpuset.  This results in continuous series of oom-kill and other
> console stack traces and apparent system hang.
> 
> Change cpusets to use node_states[N_MEMORY] [a.k.a.
> node_memory_map] in place of node_online_map when vetting 
> memories.  Return error if admin attempts to write a non-empty
> mems_allowed node mask containing only memoryless-nodes.
> 
> Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

Lee, while looking at this change, I think it ends up fixing
cpuset_mems_allowed() to return nodemasks that only include nodes in
node_states[N_MEMORY]. However, cpuset_current_mems_allowed is a
lockless macro which would still be broken. I think it would need to
becom a static inline nodes_and() in the CPUSET case and a #define
node_states[N_MEMORY] in the non-CPUSET case?

Or perhaps we should adjust cpusets to make it so that the mems_allowed
member only includes nodes that are set in node_states[N_MEMORY]?

What do you think? Paul?

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] Memoryless nodes:  use "node_memory_map" for cpuset mems_allowed validation
  2007-07-23 19:09       ` Nishanth Aravamudan
@ 2007-07-23 19:23         ` Paul Jackson
  2007-07-23 20:08           ` Nishanth Aravamudan
  2007-07-23 20:59         ` Lee Schermerhorn
  1 sibling, 1 reply; 48+ messages in thread
From: Paul Jackson @ 2007-07-23 19:23 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Lee.Schermerhorn, clameter, akpm, kxr, linux-mm, kamezawa.hiroyu

> Or perhaps we should adjust cpusets to make it so that the mems_allowed
> member only includes nodes that are set in node_states[N_MEMORY]?
> 
> What do you think? Paul?

Do you mean the "mems_alloed member" of the task struct ?

That might make sense - changing task->mems_allowed to just include nodes
with memory.

Someone would have to audit the entire kernel for uses of task->mems_allowed,
to see if all uses would be ok with this change.

I'm on vacation this week and next, so won't be doing that work right now.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] Memoryless nodes:  use "node_memory_map" for cpuset mems_allowed validation
  2007-07-23 19:23         ` Paul Jackson
@ 2007-07-23 20:08           ` Nishanth Aravamudan
  0 siblings, 0 replies; 48+ messages in thread
From: Nishanth Aravamudan @ 2007-07-23 20:08 UTC (permalink / raw)
  To: Paul Jackson
  Cc: Lee.Schermerhorn, clameter, akpm, kxr, linux-mm, kamezawa.hiroyu

On 23.07.2007 [12:23:33 -0700], Paul Jackson wrote:
> > Or perhaps we should adjust cpusets to make it so that the mems_allowed
> > member only includes nodes that are set in node_states[N_MEMORY]?
> > 
> > What do you think? Paul?
> 
> Do you mean the "mems_alloed member" of the task struct ?

I guess both that of the task_struct and that of the cpuset? I'm not
sure. Could we do it for both?

> That might make sense - changing task->mems_allowed to just include
> nodes with memory.

Yep.

> Someone would have to audit the entire kernel for uses of
> task->mems_allowed, to see if all uses would be ok with this change.

I am starting that now -- I'm first looking at every place (in -mm,
admittedly) that mems_allowed is assigned. Since now it's possible that
we'll have to do extra checking if some sort of rebinding to memoryless
nodes would occur (which we currently wouldn't even notice, AFAICT).

> I'm on vacation this week and next, so won't be doing that work right
> now.

Ok, thanks for taking the time to reply! I will try and spin something
up for you to review when you're back from vacation.

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] Memoryless nodes:  use "node_memory_map" for cpuset mems_allowed validation
  2007-07-23 19:09       ` Nishanth Aravamudan
  2007-07-23 19:23         ` Paul Jackson
@ 2007-07-23 20:59         ` Lee Schermerhorn
  2007-07-23 21:48           ` Nishanth Aravamudan
  1 sibling, 1 reply; 48+ messages in thread
From: Lee Schermerhorn @ 2007-07-23 20:59 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Christoph Lameter, Paul Jackson, akpm, kxr, linux-mm, KAMEZAWA Hiroyuki

On Mon, 2007-07-23 at 12:09 -0700, Nishanth Aravamudan wrote: 
> On 20.07.2007 [16:49:24 -0400], Lee Schermerhorn wrote:
> > This fixes a problem I encountered testing Christoph's memoryless nodes
> > series.  Applies atop that series.  Other than this, series holds up
> > under what testing I've been able to do this week.
> > 
> > Memoryless Nodes:  use "node_memory_map" for cpusets mems_allowed validation
> > 
> > cpusets try to ensure that any node added to a cpuset's 
> > mems_allowed is on-line and contains memory.  The assumption
> > was that online nodes contained memory.  Thus, it is possible
> > to add memoryless nodes to a cpuset and then add tasks to this
> > cpuset.  This results in continuous series of oom-kill and other
> > console stack traces and apparent system hang.
> > 
> > Change cpusets to use node_states[N_MEMORY] [a.k.a.
> > node_memory_map] in place of node_online_map when vetting 
> > memories.  Return error if admin attempts to write a non-empty
> > mems_allowed node mask containing only memoryless-nodes.
> > 
> > Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>
> 
> Lee, while looking at this change, I think it ends up fixing
> cpuset_mems_allowed() to return nodemasks that only include nodes in
> node_states[N_MEMORY]. However, cpuset_current_mems_allowed is a
> lockless macro which would still be broken. I think it would need to
> becom a static inline nodes_and() in the CPUSET case and a #define
> node_states[N_MEMORY] in the non-CPUSET case?
> 
> Or perhaps we should adjust cpusets to make it so that the mems_allowed
> member only includes nodes that are set in node_states[N_MEMORY]?


I thought that's what my patch to nodelist_parse() did.  It ensures that
current->mems_allowed is correct [contains at least one node with
memory, and only nodes with memory] at the time it is installed, but
doesn't consider memory hot plug and node off-lining.  Is this
[offline/hotplug] your point?

Seems like that is an issue that exists in the unpatched code as
well--i.e., unlike cpuset_mems_allowed(), the lockless, "_current_"
version does not vet current->mems_allowed against the
nodes_online_mask.  So, all valid nodes in current->mems_allowed could
have been off-lined since the mask was installed.  Am I reading this
right?

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] Memoryless nodes:  use "node_memory_map" for cpuset mems_allowed validation
  2007-07-23 20:59         ` Lee Schermerhorn
@ 2007-07-23 21:48           ` Nishanth Aravamudan
  2007-07-24 14:11             ` Lee Schermerhorn
  0 siblings, 1 reply; 48+ messages in thread
From: Nishanth Aravamudan @ 2007-07-23 21:48 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Christoph Lameter, Paul Jackson, akpm, kxr, linux-mm, KAMEZAWA Hiroyuki

On 23.07.2007 [16:59:52 -0400], Lee Schermerhorn wrote:
> On Mon, 2007-07-23 at 12:09 -0700, Nishanth Aravamudan wrote: 
> > On 20.07.2007 [16:49:24 -0400], Lee Schermerhorn wrote:
> > > This fixes a problem I encountered testing Christoph's memoryless nodes
> > > series.  Applies atop that series.  Other than this, series holds up
> > > under what testing I've been able to do this week.
> > > 
> > > Memoryless Nodes:  use "node_memory_map" for cpusets mems_allowed validation
> > > 
> > > cpusets try to ensure that any node added to a cpuset's 
> > > mems_allowed is on-line and contains memory.  The assumption
> > > was that online nodes contained memory.  Thus, it is possible
> > > to add memoryless nodes to a cpuset and then add tasks to this
> > > cpuset.  This results in continuous series of oom-kill and other
> > > console stack traces and apparent system hang.
> > > 
> > > Change cpusets to use node_states[N_MEMORY] [a.k.a.
> > > node_memory_map] in place of node_online_map when vetting 
> > > memories.  Return error if admin attempts to write a non-empty
> > > mems_allowed node mask containing only memoryless-nodes.
> > > 
> > > Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>
> > 
> > Lee, while looking at this change, I think it ends up fixing
> > cpuset_mems_allowed() to return nodemasks that only include nodes in
> > node_states[N_MEMORY]. However, cpuset_current_mems_allowed is a
> > lockless macro which would still be broken. I think it would need to
> > becom a static inline nodes_and() in the CPUSET case and a #define
> > node_states[N_MEMORY] in the non-CPUSET case?
> > 
> > Or perhaps we should adjust cpusets to make it so that the mems_allowed
> > member only includes nodes that are set in node_states[N_MEMORY]?
> 
> 
> I thought that's what my patch to nodelist_parse() did.  It ensures that
> current->mems_allowed is correct [contains at least one node with
> memory, and only nodes with memory] at the time it is installed, but
> doesn't consider memory hot plug and node off-lining.  Is this
> [offline/hotplug] your point?

And everytime it is updated, right? (current->mems_allowed). My concern
is purely whether I can then directly use cpuset_current_mems_allowed in
the interleave code for hugetlb.c and it will do the right thing. It
will work, if the #define is changed for !CPUSETS and if your change
guarantess current->mems_allowed is always consistent with
node_states[N_MEMORY].

I think I simply was confused about the full impact of your changes, as
I don't know cpusets that well. I'm going to try and test a memoryless
node box I have at work w/ your change, though, and see what happens.

> Seems like that is an issue that exists in the unpatched code as
> well--i.e., unlike cpuset_mems_allowed(), the lockless, "_current_"
> version does not vet current->mems_allowed against the
> nodes_online_mask.  So, all valid nodes in current->mems_allowed could
> have been off-lined since the mask was installed.  Am I reading this
> right?

True -- I honestly don't know. I doubt much of this code has been fully
audited for full node unplug?

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] Memoryless nodes:  use "node_memory_map" for cpuset mems_allowed validation
  2007-07-23 21:48           ` Nishanth Aravamudan
@ 2007-07-24 14:11             ` Lee Schermerhorn
  2007-07-24 16:16               ` Nishanth Aravamudan
  0 siblings, 1 reply; 48+ messages in thread
From: Lee Schermerhorn @ 2007-07-24 14:11 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Christoph Lameter, Paul Jackson, akpm, kxr, linux-mm, KAMEZAWA Hiroyuki

On Mon, 2007-07-23 at 14:48 -0700, Nishanth Aravamudan wrote:
> On 23.07.2007 [16:59:52 -0400], Lee Schermerhorn wrote:
> > On Mon, 2007-07-23 at 12:09 -0700, Nishanth Aravamudan wrote: 
> > > On 20.07.2007 [16:49:24 -0400], Lee Schermerhorn wrote:
> > > > This fixes a problem I encountered testing Christoph's memoryless nodes
> > > > series.  Applies atop that series.  Other than this, series holds up
> > > > under what testing I've been able to do this week.
> > > > 
> > > > Memoryless Nodes:  use "node_memory_map" for cpusets mems_allowed validation
> > > > 
> > > > cpusets try to ensure that any node added to a cpuset's 
> > > > mems_allowed is on-line and contains memory.  The assumption
> > > > was that online nodes contained memory.  Thus, it is possible
> > > > to add memoryless nodes to a cpuset and then add tasks to this
> > > > cpuset.  This results in continuous series of oom-kill and other
> > > > console stack traces and apparent system hang.
> > > > 
> > > > Change cpusets to use node_states[N_MEMORY] [a.k.a.
> > > > node_memory_map] in place of node_online_map when vetting 
> > > > memories.  Return error if admin attempts to write a non-empty
> > > > mems_allowed node mask containing only memoryless-nodes.
> > > > 
> > > > Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>
> > > 
> > > Lee, while looking at this change, I think it ends up fixing
> > > cpuset_mems_allowed() to return nodemasks that only include nodes in
> > > node_states[N_MEMORY]. However, cpuset_current_mems_allowed is a
> > > lockless macro which would still be broken. I think it would need to
> > > becom a static inline nodes_and() in the CPUSET case and a #define
> > > node_states[N_MEMORY] in the non-CPUSET case?
> > > 
> > > Or perhaps we should adjust cpusets to make it so that the mems_allowed
> > > member only includes nodes that are set in node_states[N_MEMORY]?
> > 
> > 
> > I thought that's what my patch to nodelist_parse() did.  It ensures that
> > current->mems_allowed is correct [contains at least one node with
> > memory, and only nodes with memory] at the time it is installed, but
> > doesn't consider memory hot plug and node off-lining.  Is this
> > [offline/hotplug] your point?
> 
> And everytime it is updated, right? (current->mems_allowed).   My concern
> is purely whether I can then directly use cpuset_current_mems_allowed in
> the interleave code for hugetlb.c and it will do the right thing. It
> will work, if the #define is changed for !CPUSETS and if your change
> guarantess current->mems_allowed is always consistent with
> node_states[N_MEMORY].

Other than offlining/hot removal of memory, I think the only place that
current->mems_allowed gets updated in in update_nodelist() [I wrote
nodelist_parse() previously by mistake].  My patch to that function
tries to ensure that current->mems_allowed always contains at least one
node with memory.

If by "gets updated" you're referring to
"cpuset_update_task_memory_state(), the latter calls
"guarantee_online_mems()", which I also patched to use
node_states[N_MEMORY] instead of "node_online_map".  So, I think you can
use current->mems_allowed in the hugetlb code.  Maybe call
"cpuset_update_task_memory_state()" before using it?  However, I think
that will have the effect of escaping the cpuset constraints if all of
the nodes in the current task's mems_allowed have been offlined or hot
removed since this mask was created/updated in update_nodelist().

> 
> I think I simply was confused about the full impact of your changes, as
> I don't know cpusets that well. I'm going to try and test a memoryless
> node box I have at work w/ your change, though, and see what happens.

FYI:  I initially tried to test Christoph's memless nodes series with
your rebased hugetlb patches, but the system appeared to hang.  [Might
be related to Ken Chen's recent hugetlb patch?]  I backed off to just
Christoph's series and things seem to run OK.  That's when I noticed
that one could create a cpuset with just memoryless nodes and posted the
subject patch.  I'll get back to testing your patches on my memoryless
nodes system "real soon now".

Meanwhile, as you've pointed out, I missed the "node_online_map" usage
in the header and, I see, in the initialization of the top level cpuset
in cpuset_init_smp().  I'm testing this now.  I'll repost the patch with
these fixes shortly.

For completeness, here's the numactl --hardware output [less the SLIT
info] from my test platform [ia64] in it's current config:

available: 5 nodes (0-4)
node 0 size: 0 MB
node 0 free: 0 MB
node 1 size: 0 MB
node 1 free: 0 MB
node 2 size: 0 MB
node 2 free: 0 MB
node 3 size: 0 MB
node 3 free: 0 MB
node 4 size: 8191 MB
node 4 free: 105 MB

Booted with mem=8G to ensure swapping, ...  Free mem is so low because
of the tests I'm running.  It varies between ~40M and ~150M.

> 
> > Seems like that is an issue that exists in the unpatched code as
> > well--i.e., unlike cpuset_mems_allowed(), the lockless, "_current_"
> > version does not vet current->mems_allowed against the
> > nodes_online_mask.  So, all valid nodes in current->mems_allowed could
> > have been off-lined since the mask was installed.  Am I reading this
> > right?
> 
> True -- I honestly don't know. I doubt much of this code has been fully
> audited for full node unplug?

Looks like at least an initial stab has been made...


Lee


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH take2] Memoryless nodes:  use "node_memory_map" for cpuset mems_allowed validation
  2007-07-11 19:06   ` [patch 01/12] NUMA: Generic management of nodemasks for various purposes Christoph Lameter
  2007-07-11 19:32     ` Lee Schermerhorn
  2007-07-20 20:49     ` [PATCH] Memoryless nodes: use "node_memory_map" for cpuset mems_allowed validation Lee Schermerhorn
@ 2007-07-24 14:15     ` Lee Schermerhorn
  2007-07-24 16:19       ` Nishanth Aravamudan
  2007-07-24 20:30     ` [PATCH take3] " Lee Schermerhorn
  2007-07-24 20:35     ` [PATCH/RFC] Memoryless nodes: Suppress redundant "node with no memory" messages Lee Schermerhorn
  4 siblings, 1 reply; 48+ messages in thread
From: Lee Schermerhorn @ 2007-07-24 14:15 UTC (permalink / raw)
  To: Christoph Lameter, Paul Jackson
  Cc: akpm, kxr, linux-mm, Nishanth Aravamudan, KAMEZAWA Hiroyuki

Memoryless Nodes:  use "node_memory_map" for cpusets - take 2

Against 2.6.22-rc6-mm1 atop Christoph Lameter's memoryless nodes
series

take 2:
+ replaced node_online_map in cpuset_current_mems_allowed()
  with node_states[N_MEMORY]
+ replaced node_online_map in cpuset_init_smp() with
  node_states[N_MEMORY]

cpusets try to ensure that any node added to a cpuset's 
mems_allowed is on-line and contains memory.  The assumption
was that online nodes contained memory.  Thus, it is possible
to add memoryless nodes to a cpuset and then add tasks to this
cpuset.  This results in continuous series of oom-kill and
apparent system hang.

Change cpusets to use node_states[N_MEMORY] [a.k.a.
node_memory_map] in place of node_online_map when vetting 
memories.  Return error if admin attempts to write a non-empty
mems_allowed node mask containing only memoryless-nodes.

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

 include/linux/cpuset.h |    2 +-
 kernel/cpuset.c        |   35 ++++++++++++++++++++++++-----------
 2 files changed, 25 insertions(+), 12 deletions(-)

Index: Linux/kernel/cpuset.c
===================================================================
--- Linux.orig/kernel/cpuset.c	2007-07-20 16:02:01.000000000 -0400
+++ Linux/kernel/cpuset.c	2007-07-24 08:38:38.000000000 -0400
@@ -316,26 +316,26 @@ static void guarantee_online_cpus(const 
 
 /*
  * Return in *pmask the portion of a cpusets's mems_allowed that
- * are online.  If none are online, walk up the cpuset hierarchy
- * until we find one that does have some online mems.  If we get
- * all the way to the top and still haven't found any online mems,
- * return node_online_map.
+ * are online, with memory.  If none are online with memory, walk
+ * up the cpuset hierarchy until we find one that does have some
+ * online mems.  If we get all the way to the top and still haven't
+ * found any online mems, return node_states[N_MEMORY].
  *
  * One way or another, we guarantee to return some non-empty subset
- * of node_online_map.
+ * of node_states[N_MEMORY].
  *
  * Call with callback_mutex held.
  */
 
 static void guarantee_online_mems(const struct cpuset *cs, nodemask_t *pmask)
 {
-	while (cs && !nodes_intersects(cs->mems_allowed, node_online_map))
+	while (cs && !nodes_intersects(cs->mems_allowed, node_states[N_MEMORY]))
 		cs = cs->parent;
 	if (cs)
-		nodes_and(*pmask, cs->mems_allowed, node_online_map);
+		nodes_and(*pmask, cs->mems_allowed, node_states[N_MEMORY]);
 	else
-		*pmask = node_online_map;
-	BUG_ON(!nodes_intersects(*pmask, node_online_map));
+		*pmask = node_states[N_MEMORY];
+	BUG_ON(!nodes_intersects(*pmask, node_states[N_MEMORY]));
 }
 
 /**
@@ -623,8 +623,21 @@ static int update_nodemask(struct cpuset
 		retval = nodelist_parse(buf, trialcs.mems_allowed);
 		if (retval < 0)
 			goto done;
+		if (!nodes_intersects(trialcs.mems_allowed,
+						node_states[N_MEMORY])) {
+			/*
+			 * error if only memoryless nodes specified.
+			 */
+			retval = -ENOSPC;
+			goto done;
+		}
 	}
-	nodes_and(trialcs.mems_allowed, trialcs.mems_allowed, node_online_map);
+	/*
+	 * Exclude memoryless nodes.  We know that trialcs.mems_allowed
+	 * contains at least one node with memory.
+	 */
+	nodes_and(trialcs.mems_allowed, trialcs.mems_allowed,
+						node_states[N_MEMORY]);
 	oldmem = cs->mems_allowed;
 	if (nodes_equal(oldmem, trialcs.mems_allowed)) {
 		retval = 0;		/* Too easy - nothing to do */
@@ -1432,7 +1445,7 @@ void cpuset_track_online_nodes(void)
 void __init cpuset_init_smp(void)
 {
 	top_cpuset.cpus_allowed = cpu_online_map;
-	top_cpuset.mems_allowed = node_online_map;
+	top_cpuset.mems_allowed = node_states[N_MEMORY];
 
 	hotcpu_notifier(cpuset_handle_cpuhp, 0);
 }
Index: Linux/include/linux/cpuset.h
===================================================================
--- Linux.orig/include/linux/cpuset.h	2007-07-13 15:41:45.000000000 -0400
+++ Linux/include/linux/cpuset.h	2007-07-24 08:37:50.000000000 -0400
@@ -92,7 +92,7 @@ static inline nodemask_t cpuset_mems_all
 	return node_possible_map;
 }
 
-#define cpuset_current_mems_allowed (node_online_map)
+#define cpuset_current_mems_allowed (node_states[N_MEMORY))
 static inline void cpuset_init_current_mems_allowed(void) {}
 static inline void cpuset_update_task_memory_state(void) {}
 #define cpuset_nodes_subset_current_mems_allowed(nodes) (1)


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] Memoryless nodes:  use "node_memory_map" for cpuset mems_allowed validation
  2007-07-24 14:11             ` Lee Schermerhorn
@ 2007-07-24 16:16               ` Nishanth Aravamudan
  0 siblings, 0 replies; 48+ messages in thread
From: Nishanth Aravamudan @ 2007-07-24 16:16 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Christoph Lameter, Paul Jackson, akpm, kxr, linux-mm, KAMEZAWA Hiroyuki

On 24.07.2007 [10:11:03 -0400], Lee Schermerhorn wrote:
> On Mon, 2007-07-23 at 14:48 -0700, Nishanth Aravamudan wrote:
> > On 23.07.2007 [16:59:52 -0400], Lee Schermerhorn wrote:
> > > On Mon, 2007-07-23 at 12:09 -0700, Nishanth Aravamudan wrote: 
> > > > On 20.07.2007 [16:49:24 -0400], Lee Schermerhorn wrote:
> > > > > This fixes a problem I encountered testing Christoph's memoryless nodes
> > > > > series.  Applies atop that series.  Other than this, series holds up
> > > > > under what testing I've been able to do this week.
> > > > > 
> > > > > Memoryless Nodes:  use "node_memory_map" for cpusets mems_allowed validation
> > > > > 
> > > > > cpusets try to ensure that any node added to a cpuset's 
> > > > > mems_allowed is on-line and contains memory.  The assumption
> > > > > was that online nodes contained memory.  Thus, it is possible
> > > > > to add memoryless nodes to a cpuset and then add tasks to this
> > > > > cpuset.  This results in continuous series of oom-kill and other
> > > > > console stack traces and apparent system hang.
> > > > > 
> > > > > Change cpusets to use node_states[N_MEMORY] [a.k.a.
> > > > > node_memory_map] in place of node_online_map when vetting 
> > > > > memories.  Return error if admin attempts to write a non-empty
> > > > > mems_allowed node mask containing only memoryless-nodes.
> > > > > 
> > > > > Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>
> > > > 
> > > > Lee, while looking at this change, I think it ends up fixing
> > > > cpuset_mems_allowed() to return nodemasks that only include nodes in
> > > > node_states[N_MEMORY]. However, cpuset_current_mems_allowed is a
> > > > lockless macro which would still be broken. I think it would need to
> > > > becom a static inline nodes_and() in the CPUSET case and a #define
> > > > node_states[N_MEMORY] in the non-CPUSET case?
> > > > 
> > > > Or perhaps we should adjust cpusets to make it so that the mems_allowed
> > > > member only includes nodes that are set in node_states[N_MEMORY]?
> > > 
> > > 
> > > I thought that's what my patch to nodelist_parse() did.  It ensures that
> > > current->mems_allowed is correct [contains at least one node with
> > > memory, and only nodes with memory] at the time it is installed, but
> > > doesn't consider memory hot plug and node off-lining.  Is this
> > > [offline/hotplug] your point?
> > 
> > And everytime it is updated, right? (current->mems_allowed).   My concern
> > is purely whether I can then directly use cpuset_current_mems_allowed in
> > the interleave code for hugetlb.c and it will do the right thing. It
> > will work, if the #define is changed for !CPUSETS and if your change
> > guarantess current->mems_allowed is always consistent with
> > node_states[N_MEMORY].
> 
> Other than offlining/hot removal of memory, I think the only place
> that current->mems_allowed gets updated in in update_nodelist() [I
> wrote nodelist_parse() previously by mistake].  My patch to that
> function tries to ensure that current->mems_allowed always contains at
> least one node with memory.

Ok.

> If by "gets updated" you're referring to
> "cpuset_update_task_memory_state(), the latter calls
> "guarantee_online_mems()", which I also patched to use
> node_states[N_MEMORY] instead of "node_online_map".  So, I think you
> can use current->mems_allowed in the hugetlb code.  Maybe call
> "cpuset_update_task_memory_state()" before using it?  However, I think
> that will have the effect of escaping the cpuset constraints if all of
> the nodes in the current task's mems_allowed have been offlined or hot
> removed since this mask was created/updated in update_nodelist().

Ok, well, I'll test on a memoryless configuration here and see if just
using cpuset_current_mems_allowed is sufficient.

> > I think I simply was confused about the full impact of your changes,
> > as I don't know cpusets that well. I'm going to try and test a
> > memoryless node box I have at work w/ your change, though, and see
> > what happens.
> 
> FYI:  I initially tried to test Christoph's memless nodes series with
> your rebased hugetlb patches, but the system appeared to hang.  [Might
> be related to Ken Chen's recent hugetlb patch?]

Were you running all of the patches on top of -mm? That's what I've been
testing as well.

> I backed off to just Christoph's series and things seem to run OK.
> That's when I noticed that one could create a cpuset with just
> memoryless nodes and posted the subject patch.  I'll get back to
> testing your patches on my memoryless nodes system "real soon now".

Cool, please keep me posted.

> Meanwhile, as you've pointed out, I missed the "node_online_map" usage
> in the header and, I see, in the initialization of the top level
> cpuset in cpuset_init_smp().  I'm testing this now.  I'll repost the
> patch with these fixes shortly.

Cool, I see it in my inbox.

> For completeness, here's the numactl --hardware output [less the SLIT
> info] from my test platform [ia64] in it's current config:

Good info to have. A very restricted configuration :)

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH take2] Memoryless nodes:  use "node_memory_map" for cpuset mems_allowed validation
  2007-07-24 14:15     ` [PATCH take2] " Lee Schermerhorn
@ 2007-07-24 16:19       ` Nishanth Aravamudan
  2007-07-24 19:01         ` Lee Schermerhorn
  0 siblings, 1 reply; 48+ messages in thread
From: Nishanth Aravamudan @ 2007-07-24 16:19 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Christoph Lameter, Paul Jackson, akpm, kxr, linux-mm, KAMEZAWA Hiroyuki

On 24.07.2007 [10:15:25 -0400], Lee Schermerhorn wrote:
> Memoryless Nodes:  use "node_memory_map" for cpusets - take 2
> 
> Against 2.6.22-rc6-mm1 atop Christoph Lameter's memoryless nodes
> series
> 
> take 2:
> + replaced node_online_map in cpuset_current_mems_allowed()
>   with node_states[N_MEMORY]
> + replaced node_online_map in cpuset_init_smp() with
>   node_states[N_MEMORY]
> 
> cpusets try to ensure that any node added to a cpuset's 
> mems_allowed is on-line and contains memory.  The assumption
> was that online nodes contained memory.  Thus, it is possible
> to add memoryless nodes to a cpuset and then add tasks to this
> cpuset.  This results in continuous series of oom-kill and
> apparent system hang.
> 
> Change cpusets to use node_states[N_MEMORY] [a.k.a.
> node_memory_map] in place of node_online_map when vetting 
> memories.  Return error if admin attempts to write a non-empty
> mems_allowed node mask containing only memoryless-nodes.

I think you still are missing a few comment changes (anything mentioning
'track'ing node_online_map will need to be changed, I think). Also, I
don't see the necessary change in common_cpu_mem_hotplug_unplug()
similar to cpuset_init_smp()'s change.

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH take2] Memoryless nodes:  use "node_memory_map" for cpuset mems_allowed validation
  2007-07-24 16:19       ` Nishanth Aravamudan
@ 2007-07-24 19:01         ` Lee Schermerhorn
  2007-07-25 15:50           ` Nishanth Aravamudan
  0 siblings, 1 reply; 48+ messages in thread
From: Lee Schermerhorn @ 2007-07-24 19:01 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Christoph Lameter, Paul Jackson, akpm, kxr, linux-mm, KAMEZAWA Hiroyuki

On Tue, 2007-07-24 at 09:19 -0700, Nishanth Aravamudan wrote:
> On 24.07.2007 [10:15:25 -0400], Lee Schermerhorn wrote:
> > Memoryless Nodes:  use "node_memory_map" for cpusets - take 2
> > 
> > Against 2.6.22-rc6-mm1 atop Christoph Lameter's memoryless nodes
> > series
> > 
> > take 2:
> > + replaced node_online_map in cpuset_current_mems_allowed()
> >   with node_states[N_MEMORY]
> > + replaced node_online_map in cpuset_init_smp() with
> >   node_states[N_MEMORY]
> > 
> > cpusets try to ensure that any node added to a cpuset's 
> > mems_allowed is on-line and contains memory.  The assumption
> > was that online nodes contained memory.  Thus, it is possible
> > to add memoryless nodes to a cpuset and then add tasks to this
> > cpuset.  This results in continuous series of oom-kill and
> > apparent system hang.
> > 
> > Change cpusets to use node_states[N_MEMORY] [a.k.a.
> > node_memory_map] in place of node_online_map when vetting 
> > memories.  Return error if admin attempts to write a non-empty
> > mems_allowed node mask containing only memoryless-nodes.
> 
> I think you still are missing a few comment changes (anything mentioning
> 'track'ing node_online_map will need to be changed, I think). Also, I
> don't see the necessary change in common_cpu_mem_hotplug_unplug()
> similar to cpuset_init_smp()'s change.

Sorry.  Multitasking meltdown...  Will fix.

Meanwhile:  

I've tested your 3 patches atop Christoph's series [on 22-rc6-mm1], with
and without my cpuset patch and I can't reproduce the hang I saw a
couple of days ago :-(.  I hate it when that happens!  Perhaps some
system daemon started up during the test that hung.

Later,
Lee


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH take3] Memoryless nodes:  use "node_memory_map" for cpuset mems_allowed validation
  2007-07-11 19:06   ` [patch 01/12] NUMA: Generic management of nodemasks for various purposes Christoph Lameter
                       ` (2 preceding siblings ...)
  2007-07-24 14:15     ` [PATCH take2] " Lee Schermerhorn
@ 2007-07-24 20:30     ` Lee Schermerhorn
  2007-07-25 15:53       ` Nishanth Aravamudan
                         ` (2 more replies)
  2007-07-24 20:35     ` [PATCH/RFC] Memoryless nodes: Suppress redundant "node with no memory" messages Lee Schermerhorn
  4 siblings, 3 replies; 48+ messages in thread
From: Lee Schermerhorn @ 2007-07-24 20:30 UTC (permalink / raw)
  To: Christoph Lameter, Paul Jackson, Nishanth Aravamudan
  Cc: akpm, kxr, linux-mm, KAMEZAWA Hiroyuki

Memoryless Nodes:  use "node_memory_map" for cpusets - take 3

Against 2.6.22-rc6-mm1 atop Christoph Lameter's memoryless nodes
series

take 2:
+ replaced node_online_map in cpuset_current_mems_allowed()
  with node_states[N_MEMORY]
+ replaced node_online_map in cpuset_init_smp() with
  node_states[N_MEMORY]

take 3:
+ fix up comments and top level cpuset tracking of nodes
  with memory [instead of on-line nodes].
+ maybe I got them all this time?

cpusets try to ensure that any node added to a cpuset's 
mems_allowed is on-line and contains memory.  The assumption
was that online nodes contained memory.  Thus, it is possible
to add memoryless nodes to a cpuset and then add tasks to this
cpuset.  This results in continuous series of oom-kill and
apparent system hang.

Change cpusets to use node_states[N_MEMORY] [a.k.a.
node_memory_map] in place of node_online_map when vetting 
memories.  Return error if admin attempts to write a non-empty
mems_allowed node mask containing only memoryless-nodes.

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

 include/linux/cpuset.h |    2 -
 kernel/cpuset.c        |   51 +++++++++++++++++++++++++++++++------------------
 2 files changed, 34 insertions(+), 19 deletions(-)

Index: Linux/kernel/cpuset.c
===================================================================
--- Linux.orig/kernel/cpuset.c	2007-07-24 11:24:56.000000000 -0400
+++ Linux/kernel/cpuset.c	2007-07-24 12:20:40.000000000 -0400
@@ -316,26 +316,26 @@ static void guarantee_online_cpus(const 
 
 /*
  * Return in *pmask the portion of a cpusets's mems_allowed that
- * are online.  If none are online, walk up the cpuset hierarchy
- * until we find one that does have some online mems.  If we get
- * all the way to the top and still haven't found any online mems,
- * return node_online_map.
+ * are online, with memory.  If none are online with memory, walk
+ * up the cpuset hierarchy until we find one that does have some
+ * online mems.  If we get all the way to the top and still haven't
+ * found any online mems, return node_states[N_MEMORY].
  *
  * One way or another, we guarantee to return some non-empty subset
- * of node_online_map.
+ * of node_states[N_MEMORY].
  *
  * Call with callback_mutex held.
  */
 
 static void guarantee_online_mems(const struct cpuset *cs, nodemask_t *pmask)
 {
-	while (cs && !nodes_intersects(cs->mems_allowed, node_online_map))
+	while (cs && !nodes_intersects(cs->mems_allowed, node_states[N_MEMORY]))
 		cs = cs->parent;
 	if (cs)
-		nodes_and(*pmask, cs->mems_allowed, node_online_map);
+		nodes_and(*pmask, cs->mems_allowed, node_states[N_MEMORY]);
 	else
-		*pmask = node_online_map;
-	BUG_ON(!nodes_intersects(*pmask, node_online_map));
+		*pmask = node_states[N_MEMORY];
+	BUG_ON(!nodes_intersects(*pmask, node_states[N_MEMORY]));
 }
 
 /**
@@ -606,7 +606,7 @@ static int update_nodemask(struct cpuset
 	int retval;
 	struct container_iter it;
 
-	/* top_cpuset.mems_allowed tracks node_online_map; it's read-only */
+	/* top_cpuset.mems_allowed tracks node_states[N_MEMORY]; it's read-only */
 	if (cs == &top_cpuset)
 		return -EACCES;
 
@@ -623,8 +623,21 @@ static int update_nodemask(struct cpuset
 		retval = nodelist_parse(buf, trialcs.mems_allowed);
 		if (retval < 0)
 			goto done;
+		if (!nodes_intersects(trialcs.mems_allowed,
+						node_states[N_MEMORY])) {
+			/*
+			 * error if only memoryless nodes specified.
+			 */
+			retval = -ENOSPC;
+			goto done;
+		}
 	}
-	nodes_and(trialcs.mems_allowed, trialcs.mems_allowed, node_online_map);
+	/*
+	 * Exclude memoryless nodes.  We know that trialcs.mems_allowed
+	 * contains at least one node with memory.
+	 */
+	nodes_and(trialcs.mems_allowed, trialcs.mems_allowed,
+						node_states[N_MEMORY]);
 	oldmem = cs->mems_allowed;
 	if (nodes_equal(oldmem, trialcs.mems_allowed)) {
 		retval = 0;		/* Too easy - nothing to do */
@@ -1366,8 +1379,9 @@ static void guarantee_online_cpus_mems_i
 
 /*
  * The cpus_allowed and mems_allowed nodemasks in the top_cpuset track
- * cpu_online_map and node_online_map.  Force the top cpuset to track
- * whats online after any CPU or memory node hotplug or unplug event.
+ * cpu_online_map and node_states[N_MEMORY].  Force the top cpuset to
+ * track what's online after any CPU or memory node hotplug or unplug
+ * event.
  *
  * To ensure that we don't remove a CPU or node from the top cpuset
  * that is currently in use by a child cpuset (which would violate
@@ -1387,7 +1401,7 @@ static void common_cpu_mem_hotplug_unplu
 
 	guarantee_online_cpus_mems_in_subtree(&top_cpuset);
 	top_cpuset.cpus_allowed = cpu_online_map;
-	top_cpuset.mems_allowed = node_online_map;
+	top_cpuset.mems_allowed = node_states[N_MEMORY];
 
 	mutex_unlock(&callback_mutex);
 	container_unlock();
@@ -1412,8 +1426,9 @@ static int cpuset_handle_cpuhp(struct no
 
 #ifdef CONFIG_MEMORY_HOTPLUG
 /*
- * Keep top_cpuset.mems_allowed tracking node_online_map.
- * Call this routine anytime after you change node_online_map.
+ * Keep top_cpuset.mems_allowed tracking node_states[N_MEMORY].
+ * Call this routine anytime after you change
+ * node_states[N_MEMORY].
  * See also the previous routine cpuset_handle_cpuhp().
  */
 
@@ -1432,7 +1447,7 @@ void cpuset_track_online_nodes(void)
 void __init cpuset_init_smp(void)
 {
 	top_cpuset.cpus_allowed = cpu_online_map;
-	top_cpuset.mems_allowed = node_online_map;
+	top_cpuset.mems_allowed = node_states[N_MEMORY];
 
 	hotcpu_notifier(cpuset_handle_cpuhp, 0);
 }
@@ -1472,7 +1487,7 @@ void cpuset_init_current_mems_allowed(vo
  *
  * Description: Returns the nodemask_t mems_allowed of the cpuset
  * attached to the specified @tsk.  Guaranteed to return some non-empty
- * subset of node_online_map, even if this means going outside the
+ * subset of node_states[N_MEMORY], even if this means going outside the
  * tasks cpuset.
  **/
 
Index: Linux/include/linux/cpuset.h
===================================================================
--- Linux.orig/include/linux/cpuset.h	2007-07-24 11:24:56.000000000 -0400
+++ Linux/include/linux/cpuset.h	2007-07-24 12:20:56.000000000 -0400
@@ -92,7 +92,7 @@ static inline nodemask_t cpuset_mems_all
 	return node_possible_map;
 }
 
-#define cpuset_current_mems_allowed (node_online_map)
+#define cpuset_current_mems_allowed (node_states[N_MEMORY))
 static inline void cpuset_init_current_mems_allowed(void) {}
 static inline void cpuset_update_task_memory_state(void) {}
 #define cpuset_nodes_subset_current_mems_allowed(nodes) (1)


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH/RFC] Memoryless nodes:  Suppress redundant "node with no memory" messages
  2007-07-11 19:06   ` [patch 01/12] NUMA: Generic management of nodemasks for various purposes Christoph Lameter
                       ` (3 preceding siblings ...)
  2007-07-24 20:30     ` [PATCH take3] " Lee Schermerhorn
@ 2007-07-24 20:35     ` Lee Schermerhorn
  2007-07-25 15:56       ` Nishanth Aravamudan
  4 siblings, 1 reply; 48+ messages in thread
From: Lee Schermerhorn @ 2007-07-24 20:35 UTC (permalink / raw)
  To: linux-mm
  Cc: Christoph Lameter, Nishanth Aravamudan, akpm, kxr, KAMEZAWA Hiroyuki

Suppress redundant "node with no memory" messages

Against 2.6.22-rc6-mm1 atop Christoph Lameter's memoryless
node series.

get_pfn_range_for_nid() is called multiple times for each node
at boot time.  Each time, it will warn about nodes with no
memory, resulting in boot messages like:

	Node 0 active with no memory
	Node 0 active with no memory
	Node 0 active with no memory
	Node 0 active with no memory
	Node 0 active with no memory
	Node 0 active with no memory
	On node 0 totalpages: 0
	Node 0 active with no memory
	Node 0 active with no memory
	  DMA zone: 0 pages used for memmap
	Node 0 active with no memory
	Node 0 active with no memory
	  Normal zone: 0 pages used for memmap
	Node 0 active with no memory
	Node 0 active with no memory
	  Movable zone: 0 pages used for memmap

and so on for each memoryless node.  Track [in init data]
memoryless nodes that we've already reported to suppress
the redundant messages.

OR, we could eliminate the message altogether?  We do
report zero totalpages.  Sufficient?

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

 mm/page_alloc.c |    8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

Index: Linux/mm/page_alloc.c
===================================================================
--- Linux.orig/mm/page_alloc.c	2007-07-13 15:52:22.000000000 -0400
+++ Linux/mm/page_alloc.c	2007-07-24 12:37:35.000000000 -0400
@@ -3081,6 +3081,8 @@ static void __meminit account_node_bound
  * with no available memory, a warning is printed and the start and end
  * PFNs will be 0.
  */
+static nodemask_t __meminitdata memoryless_nodes;
+
 void __meminit get_pfn_range_for_nid(unsigned int nid,
 			unsigned long *start_pfn, unsigned long *end_pfn)
 {
@@ -3094,7 +3096,11 @@ void __meminit get_pfn_range_for_nid(uns
 	}
 
 	if (*start_pfn == -1UL) {
-		printk(KERN_WARNING "Node %u active with no memory\n", nid);
+		if (!node_isset(nid, memoryless_nodes)) {
+			printk(KERN_WARNING "Node %u active with no memory\n",
+						 nid);
+			node_set(nid, memoryless_nodes);
+		}
 		*start_pfn = 0;
 	}
 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [patch 00/12] NUMA: Memoryless node support V3
       [not found]         ` <1185372692.5604.22.camel@localhost>
@ 2007-07-25 15:45           ` Lee Schermerhorn
  2007-07-25 19:16             ` 2.6.23-rc1-mm1: boot hang on ia64 with memoryless nodes Lee Schermerhorn
  0 siblings, 1 reply; 48+ messages in thread
From: Lee Schermerhorn @ 2007-07-25 15:45 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: kxr, Andrew Morton, linux-mm, Bob Picco, Eric Whitney

On Wed, 2007-07-25 at 10:11 -0400, Lee Schermerhorn wrote:
> On Tue, 2007-07-24 at 14:04 -0700, Christoph Lameter wrote:
> > On Tue, 24 Jul 2007, Lee Schermerhorn wrote:
> <snip>
> > 
> > > I have tested your series, with Nish's and my patches in a
> > > memory-constrained config with all of the cpus on memoryless nodes and
> > > all of the memory in a cpu-less pseudo-node.  Seems to hold up fairly
> > > well under stress.  I did see a hang on Monday--test hung, very little
> > > free memory, pdflush just trickling out pages--but haven't been able to
> > > reproduce it.  Don't know what happened.
> > 
> > Hmm... Not good. Was that with a pre rc1 release? I got a hang here on a 
> > simulator that seems to be related to high res timers.
> 
> It was on 22-rc6-mm1.
> 
> >  
> > > I haven't had a chance to poke at it with memtoy to see how the
> > > interleave and hugepages work.  But, most folks don't use that, so I
> > > think it's appropriate for -mm.  
> > > 
> > > How should we proceed?  Shall I Ack the patches, mentioning the testing
> > > I've done and recommend inclusion in -mm?
> > 
> > Could you post the patchset with your acks or signoffs if you have made 
> > changes? Address them to Andrew, cc me and I will support merging of what 
> > you got. Note though that I think were are at the beginning of dealing 
> > with nodeless and per node memory use.
> 
> I'm rebasing to 23-rc1-mm1 right now.  Will do a quick test and repost.


!!! :-(  This is going to take longer than I thought.

1) ia64 build breakage due to ACPI_SLEEP -- have work around hack that
I'll send to Andrew as temp hot fix, but that's not the worst of it.

2) fails to boot with:

Unable to handle kernel paging request at virtual address a000400002000020
swapper[0]: Oops 11003706212352 [1]
Modules linked in:

Pid: 0, CPU 0, comm:              swapper
psr : 00001210084a2010 ifs : 8000000000000995 ip  : [<a0000001008a4df1>]    Not tainted
ip is at memmap_init_zone+0x271/0x2a0
unat: 0000000000000000 pfs : 0000000000000995 rsc : 0000000000000003
rnat: 0000000000000000 bsps: 0000000000000000 pr  : 656960155aa595a9
ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c8a70433f
csd : 0000000000000000 ssd : 0000000000000000
b0  : a0000001008a4de0 b6  : a000000100340240 b7  : a0000001008ba3a0
f6  : 1003e0000000000000000 f7  : 1003e0000000000000974
f8  : 1003e000000000000e6e0 f9  : 1003e00000000001cdc02
f10 : 1003e0000000000000005 f11 : 1003e0000000000012785
r1  : a000000100c088a0 r2  : 0000000000000000 r3  : 0000000000002492
r8  : 0000000000000002 r9  : a000000100a08fe0 r10 : e000000100360700
r11 : 0000000000000000 r12 : a00000010092fd20 r13 : a000000100928000
r14 : fffffffffffffffb r15 : 0000000000000002 r16 : e000000101ca8000
r17 : 0000000000000003 r18 : a000400002000018 r19 : 0000000000000000
r20 : a000000100b5de00 r21 : 0000000000000008 r22 : e000000101ca8000
r23 : 0000000000000001 r24 : 5fffffffffe4924a r25 : 0006db6d7fe4924a
r26 : 5ff9249280000000 r27 : 0000000db6bf924a r28 : 0006db5fc9250000
r29 : 0000000e27ff8ec0 r30 : 00000000713ffc76 r31 : 000000001c3fff1e
WARNING: at mm/page_alloc.c:1562 __alloc_pages()

... then hard hang.

------------
I'll go ahead with the patch rebase while trying to debug this.

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH take2] Memoryless nodes:  use "node_memory_map" for cpuset mems_allowed validation
  2007-07-24 19:01         ` Lee Schermerhorn
@ 2007-07-25 15:50           ` Nishanth Aravamudan
  0 siblings, 0 replies; 48+ messages in thread
From: Nishanth Aravamudan @ 2007-07-25 15:50 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Christoph Lameter, Paul Jackson, akpm, kxr, linux-mm, KAMEZAWA Hiroyuki

On 24.07.2007 [15:01:33 -0400], Lee Schermerhorn wrote:
> On Tue, 2007-07-24 at 09:19 -0700, Nishanth Aravamudan wrote:
> > On 24.07.2007 [10:15:25 -0400], Lee Schermerhorn wrote:
> > > Memoryless Nodes:  use "node_memory_map" for cpusets - take 2
> > > 
> > > Against 2.6.22-rc6-mm1 atop Christoph Lameter's memoryless nodes
> > > series
> > > 
> > > take 2:
> > > + replaced node_online_map in cpuset_current_mems_allowed()
> > >   with node_states[N_MEMORY]
> > > + replaced node_online_map in cpuset_init_smp() with
> > >   node_states[N_MEMORY]
> > > 
> > > cpusets try to ensure that any node added to a cpuset's 
> > > mems_allowed is on-line and contains memory.  The assumption
> > > was that online nodes contained memory.  Thus, it is possible
> > > to add memoryless nodes to a cpuset and then add tasks to this
> > > cpuset.  This results in continuous series of oom-kill and
> > > apparent system hang.
> > > 
> > > Change cpusets to use node_states[N_MEMORY] [a.k.a.
> > > node_memory_map] in place of node_online_map when vetting 
> > > memories.  Return error if admin attempts to write a non-empty
> > > mems_allowed node mask containing only memoryless-nodes.
> > 
> > I think you still are missing a few comment changes (anything mentioning
> > 'track'ing node_online_map will need to be changed, I think). Also, I
> > don't see the necessary change in common_cpu_mem_hotplug_unplug()
> > similar to cpuset_init_smp()'s change.
> 
> Sorry.  Multitasking meltdown...  Will fix.
> 
> Meanwhile:  
> 
> I've tested your 3 patches atop Christoph's series [on 22-rc6-mm1],
> with and without my cpuset patch and I can't reproduce the hang I saw
> a couple of days ago :-(.  I hate it when that happens!  Perhaps some
> system daemon started up during the test that hung.

Hrm, that stinks. I tested on several h/w variations before posting. And
the changes are pretty transparent, so I'm not sure where we'd hang (and
I would think if we were, we'd see it now too). But I'll do another
audit just to be sure.

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH take3] Memoryless nodes:  use "node_memory_map" for cpuset mems_allowed validation
  2007-07-24 20:30     ` [PATCH take3] " Lee Schermerhorn
@ 2007-07-25 15:53       ` Nishanth Aravamudan
  2007-07-25 22:00       ` Nishanth Aravamudan
  2007-07-27  0:40       ` Nishanth Aravamudan
  2 siblings, 0 replies; 48+ messages in thread
From: Nishanth Aravamudan @ 2007-07-25 15:53 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Christoph Lameter, Paul Jackson, akpm, kxr, linux-mm, KAMEZAWA Hiroyuki

On 24.07.2007 [16:30:19 -0400], Lee Schermerhorn wrote:
> Memoryless Nodes:  use "node_memory_map" for cpusets - take 3
> 
> Against 2.6.22-rc6-mm1 atop Christoph Lameter's memoryless nodes
> series
> 
> take 2:
> + replaced node_online_map in cpuset_current_mems_allowed()
>   with node_states[N_MEMORY]
> + replaced node_online_map in cpuset_init_smp() with
>   node_states[N_MEMORY]
> 
> take 3:
> + fix up comments and top level cpuset tracking of nodes
>   with memory [instead of on-line nodes].
> + maybe I got them all this time?

Looks like it! :)

> cpusets try to ensure that any node added to a cpuset's 
> mems_allowed is on-line and contains memory.  The assumption
> was that online nodes contained memory.  Thus, it is possible
> to add memoryless nodes to a cpuset and then add tasks to this
> cpuset.  This results in continuous series of oom-kill and
> apparent system hang.
> 
> Change cpusets to use node_states[N_MEMORY] [a.k.a.
> node_memory_map] in place of node_online_map when vetting 
> memories.  Return error if admin attempts to write a non-empty
> mems_allowed node mask containing only memoryless-nodes.
> 
> Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

Tested on 4-node ppc64 with 2 memoryless nodes. Top cpuset (and all
subsequent ones) only allow nodes 0 and 1 (the nodes with memory).

Tested-by: Nishanth Aravamudan <nacc@us.ibm.com>
Acked-by: Nishanth Aravamudan <nacc@us.ibm.com>

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH/RFC] Memoryless nodes:  Suppress redundant "node with no memory" messages
  2007-07-24 20:35     ` [PATCH/RFC] Memoryless nodes: Suppress redundant "node with no memory" messages Lee Schermerhorn
@ 2007-07-25 15:56       ` Nishanth Aravamudan
  0 siblings, 0 replies; 48+ messages in thread
From: Nishanth Aravamudan @ 2007-07-25 15:56 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, Christoph Lameter, akpm, kxr, KAMEZAWA Hiroyuki

On 24.07.2007 [16:35:13 -0400], Lee Schermerhorn wrote:
> Suppress redundant "node with no memory" messages
> 
> Against 2.6.22-rc6-mm1 atop Christoph Lameter's memoryless
> node series.
> 
> get_pfn_range_for_nid() is called multiple times for each node
> at boot time.  Each time, it will warn about nodes with no
> memory, resulting in boot messages like:
> 
> 	Node 0 active with no memory
> 	Node 0 active with no memory
> 	Node 0 active with no memory
> 	Node 0 active with no memory
> 	Node 0 active with no memory
> 	Node 0 active with no memory
> 	On node 0 totalpages: 0
> 	Node 0 active with no memory
> 	Node 0 active with no memory
> 	  DMA zone: 0 pages used for memmap
> 	Node 0 active with no memory
> 	Node 0 active with no memory
> 	  Normal zone: 0 pages used for memmap
> 	Node 0 active with no memory
> 	Node 0 active with no memory
> 	  Movable zone: 0 pages used for memmap
> 
> and so on for each memoryless node.  Track [in init data]
> memoryless nodes that we've already reported to suppress
> the redundant messages.
> 
> OR, we could eliminate the message altogether?  We do
> report zero totalpages.  Sufficient?

Not being an expert, I honestly don't know. But I do think it's quite
clear that we only need one or the other type of message (presuming both
are always shown, that is neither can somehow already be disabled), as
they say the same thing :) I found this to be odd behavior too.

> Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>
Acked-by: Nishanth Aravamudan <nacc@us.ibm.com>

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* 2.6.23-rc1-mm1:  boot hang on ia64 with memoryless nodes
  2007-07-25 15:45           ` Lee Schermerhorn
@ 2007-07-25 19:16             ` Lee Schermerhorn
  2007-07-25 19:38               ` Christoph Lameter
  0 siblings, 1 reply; 48+ messages in thread
From: Lee Schermerhorn @ 2007-07-25 19:16 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: kxr, Andrew Morton, linux-mm, Bob Picco, Mel Gorman, Eric Whitney

On Wed, 2007-07-25 at 11:45 -0400, Lee Schermerhorn wrote:
<snip>

> 2) fails to boot with:
> 
> Unable to handle kernel paging request at virtual address a000400002000020
> swapper[0]: Oops 11003706212352 [1]
> Modules linked in:
> 
> Pid: 0, CPU 0, comm:              swapper
> psr : 00001210084a2010 ifs : 8000000000000995 ip  : [<a0000001008a4df1>]    Not tainted
> ip is at memmap_init_zone+0x271/0x2a0
<snip>
> 
> ... then hard hang.

I tried Mel Gormans ia64 memmap corruption patches [the ones that aren't
already in 23-rc1-mm1--looks like the original one is?], and see the
same thing.

I tried to deselect SPARSEMEM_VMEMMAP.  Kconfig's "def_bool=y" wouldn't
let me :-(.  After hacking the Kconfig and mm/sparse.c to allow that,
boot hangs with no error messages shortly after "Built N zonelists..."
message.

Backed off to DISCONTIGMEM+VIRTUAL_MEMORY_MAP, and saw same hang as with
(SPARSMEM && !SPARSEMEM_VMEMMAP).   

I should mention that I have my test system in the "fully interleaved"
configuration for testing the memoryless node patches.  This means that
nodes 0-3 [the real nodes with the cpus attached] have no memory.  All
memory resides in a cpu-less pseudo-node.  I'm wondering if
SPARSEMEM_VMEMMAP can handle this?  22-rc6-mm1 booted OK on this config
w/ SPARSEMEM_EXTREME.

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: 2.6.23-rc1-mm1:  boot hang on ia64 with memoryless nodes
  2007-07-25 19:16             ` 2.6.23-rc1-mm1: boot hang on ia64 with memoryless nodes Lee Schermerhorn
@ 2007-07-25 19:38               ` Christoph Lameter
  2007-07-25 20:03                 ` Christoph Lameter
  2007-07-25 21:18                 ` Lee Schermerhorn
  0 siblings, 2 replies; 48+ messages in thread
From: Christoph Lameter @ 2007-07-25 19:38 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: kxr, Andrew Morton, linux-mm, Bob Picco, Mel Gorman,
	Eric Whitney, Andy Whitcroft

(ccing Andy who did the work on the config stuff)

On Wed, 25 Jul 2007, Lee Schermerhorn wrote:

> I tried to deselect SPARSEMEM_VMEMMAP.  Kconfig's "def_bool=y" wouldn't
> let me :-(.  After hacking the Kconfig and mm/sparse.c to allow that,
> boot hangs with no error messages shortly after "Built N zonelists..."
> message.

I get a similar hang here and see the system looping in softirq / hrtimer 
code.

> Backed off to DISCONTIGMEM+VIRTUAL_MEMORY_MAP, and saw same hang as with
> (SPARSMEM && !SPARSEMEM_VMEMMAP).   

So its not related to SPARSE VMEMMAP? General VMEMMAP issue on IA64?
 
> I should mention that I have my test system in the "fully interleaved"
> configuration for testing the memoryless node patches.  This means that
> nodes 0-3 [the real nodes with the cpus attached] have no memory.  All
> memory resides in a cpu-less pseudo-node.  I'm wondering if
> SPARSEMEM_VMEMMAP can handle this?  22-rc6-mm1 booted OK on this config
> w/ SPARSEMEM_EXTREME.

The vmemmap page table blocks get allocated on the nodes where there 
is actual mmemory but sparse.c may not have been updated to only look for 
memory on nodes that have memory. If it looks for online nodes then we 
may have an issue there. Andy?

Were you able to run discontig/vmemmap in the past with this 
configuration?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: 2.6.23-rc1-mm1:  boot hang on ia64 with memoryless nodes
  2007-07-25 19:38               ` Christoph Lameter
@ 2007-07-25 20:03                 ` Christoph Lameter
  2007-07-25 21:18                 ` Lee Schermerhorn
  1 sibling, 0 replies; 48+ messages in thread
From: Christoph Lameter @ 2007-07-25 20:03 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: kxr, Andrew Morton, linux-mm, Bob Picco, Mel Gorman,
	Eric Whitney, Andy Whitcroft

On Wed, 25 Jul 2007, Christoph Lameter wrote:

> I get a similar hang here and see the system looping in softirq / hrtimer 
> code.

Keith Rich also has a hang with current git. My hang was with 2.6.23-rc1.
Keith Owens has significant issues with 2.6.23-rc1. Lets get this onto 
linux-ia64. I do not think we can do any testing with 2.6.23-rc1-mm1 until 
we have ironed out the issues in the base kernel.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: 2.6.23-rc1-mm1:  boot hang on ia64 with memoryless nodes
  2007-07-25 19:38               ` Christoph Lameter
  2007-07-25 20:03                 ` Christoph Lameter
@ 2007-07-25 21:18                 ` Lee Schermerhorn
  2007-07-26 13:53                   ` Lee Schermerhorn
  1 sibling, 1 reply; 48+ messages in thread
From: Lee Schermerhorn @ 2007-07-25 21:18 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: kxr, Andrew Morton, linux-mm, Bob Picco, Mel Gorman,
	Eric Whitney, Andy Whitcroft

On Wed, 2007-07-25 at 12:38 -0700, Christoph Lameter wrote:
> (ccing Andy who did the work on the config stuff)
> 
> On Wed, 25 Jul 2007, Lee Schermerhorn wrote:
> 
> > I tried to deselect SPARSEMEM_VMEMMAP.  Kconfig's "def_bool=y" wouldn't
> > let me :-(.  After hacking the Kconfig and mm/sparse.c to allow that,
> > boot hangs with no error messages shortly after "Built N zonelists..."
> > message.
> 
> I get a similar hang here and see the system looping in softirq / hrtimer 
> code.
> 
> > Backed off to DISCONTIGMEM+VIRTUAL_MEMORY_MAP, and saw same hang as with
> > (SPARSMEM && !SPARSEMEM_VMEMMAP).   
> 
> So its not related to SPARSE VMEMMAP? General VMEMMAP issue on IA64?

This hang is different from the one I see with SPARSE VMEMMAP -- no
"Unable to handle kernel paging request..." message.  Just hangs after
"Built N zonelists..."  and some message about "color" that I didn't
capture.  Next time [:-(]...

>  
> > I should mention that I have my test system in the "fully interleaved"
> > configuration for testing the memoryless node patches.  This means that
> > nodes 0-3 [the real nodes with the cpus attached] have no memory.  All
> > memory resides in a cpu-less pseudo-node.  I'm wondering if
> > SPARSEMEM_VMEMMAP can handle this?  22-rc6-mm1 booted OK on this config
> > w/ SPARSEMEM_EXTREME.
> 
> The vmemmap page table blocks get allocated on the nodes where there 
> is actual mmemory but sparse.c may not have been updated to only look for 
> memory on nodes that have memory. If it looks for online nodes then we 
> may have an issue there. Andy?

In free_area_init_nodes(), free_area_init_node() [singular] is called
for_each_online_node...   I'm looking into this.  I might need an
additional memoryless node patch to test the memoryless node patches...

> 
> Were you able to run discontig/vmemmap in the past with this 
> configuration?

Yeah, way back ~2.6.14/15 or so.  My configs have all used SPARSEMEM
since then.

I'm going to switch back to "100% cell local memory" and try again.
But, if you're seeing hangs w/o memoryless nodes, I'm not hopeful.

Later,
Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH take3] Memoryless nodes:  use "node_memory_map" for cpuset mems_allowed validation
  2007-07-24 20:30     ` [PATCH take3] " Lee Schermerhorn
  2007-07-25 15:53       ` Nishanth Aravamudan
@ 2007-07-25 22:00       ` Nishanth Aravamudan
  2007-07-26 13:04         ` Lee Schermerhorn
  2007-07-27  0:40       ` Nishanth Aravamudan
  2 siblings, 1 reply; 48+ messages in thread
From: Nishanth Aravamudan @ 2007-07-25 22:00 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Christoph Lameter, Paul Jackson, akpm, kxr, linux-mm, KAMEZAWA Hiroyuki

On 24.07.2007 [16:30:19 -0400], Lee Schermerhorn wrote:
> Memoryless Nodes:  use "node_memory_map" for cpusets - take 3
> 
> Against 2.6.22-rc6-mm1 atop Christoph Lameter's memoryless nodes
> series
> 
> take 2:
> + replaced node_online_map in cpuset_current_mems_allowed()
>   with node_states[N_MEMORY]
> + replaced node_online_map in cpuset_init_smp() with
>   node_states[N_MEMORY]
> 
> take 3:
> + fix up comments and top level cpuset tracking of nodes
>   with memory [instead of on-line nodes].
> + maybe I got them all this time?

My ack stands, but I believe Documentation/cpusets.txt will need
updating too :)

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH take3] Memoryless nodes:  use "node_memory_map" for cpuset mems_allowed validation
  2007-07-25 22:00       ` Nishanth Aravamudan
@ 2007-07-26 13:04         ` Lee Schermerhorn
  0 siblings, 0 replies; 48+ messages in thread
From: Lee Schermerhorn @ 2007-07-26 13:04 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Christoph Lameter, Paul Jackson, akpm, kxr, linux-mm, KAMEZAWA Hiroyuki

On Wed, 2007-07-25 at 15:00 -0700, Nishanth Aravamudan wrote:
> On 24.07.2007 [16:30:19 -0400], Lee Schermerhorn wrote:
> > Memoryless Nodes:  use "node_memory_map" for cpusets - take 3
> > 
> > Against 2.6.22-rc6-mm1 atop Christoph Lameter's memoryless nodes
> > series
> > 
> > take 2:
> > + replaced node_online_map in cpuset_current_mems_allowed()
> >   with node_states[N_MEMORY]
> > + replaced node_online_map in cpuset_init_smp() with
> >   node_states[N_MEMORY]
> > 
> > take 3:
> > + fix up comments and top level cpuset tracking of nodes
> >   with memory [instead of on-line nodes].
> > + maybe I got them all this time?
> 
> My ack stands, but I believe Documentation/cpusets.txt will need
> updating too :)

When [he says hopefully] I get the patches memoryless patches tested on
23-rc1-mm1, I'll include a documentation update patch.

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: 2.6.23-rc1-mm1:  boot hang on ia64 with memoryless nodes
  2007-07-25 21:18                 ` Lee Schermerhorn
@ 2007-07-26 13:53                   ` Lee Schermerhorn
  2007-07-26 14:00                     ` KAMEZAWA Hiroyuki
  2007-07-26 14:33                     ` Lee Schermerhorn
  0 siblings, 2 replies; 48+ messages in thread
From: Lee Schermerhorn @ 2007-07-26 13:53 UTC (permalink / raw)
  To: Christoph Lameter, linux-ia64
  Cc: kxr, Andrew Morton, linux-mm, Bob Picco, Mel Gorman,
	Eric Whitney, Andy Whitcroft

On Wed, 2007-07-25 at 17:18 -0400, Lee Schermerhorn wrote: 
> On Wed, 2007-07-25 at 12:38 -0700, Christoph Lameter wrote:
> > (ccing Andy who did the work on the config stuff)
> > 
> > On Wed, 25 Jul 2007, Lee Schermerhorn wrote:
> > 
> > > I tried to deselect SPARSEMEM_VMEMMAP.  Kconfig's "def_bool=y" wouldn't
> > > let me :-(.  After hacking the Kconfig and mm/sparse.c to allow that,
> > > boot hangs with no error messages shortly after "Built N zonelists..."
> > > message.
> > 
> > I get a similar hang here and see the system looping in softirq / hrtimer 
> > code.
> > 
> > > Backed off to DISCONTIGMEM+VIRTUAL_MEMORY_MAP, and saw same hang as with
> > > (SPARSMEM && !SPARSEMEM_VMEMMAP).   
> > 
> > So its not related to SPARSE VMEMMAP? General VMEMMAP issue on IA64?
> 
> This hang is different from the one I see with SPARSE VMEMMAP -- no
> "Unable to handle kernel paging request..." message.  Just hangs after
> "Built N zonelists..."  and some message about "color" that I didn't
> capture.  Next time [:-(]...

The "color" message was actually:

Console:  colour dummy device 80x25

So, now I'm wondering if I'm hitting the "Regression in serial
console..." issue, and the system was actually booting--I just didn't
see any output.  If so, the "Unable to handle kernel paging request..."
hang might well be a problem with SPARSEMEM_VMEMMAP...

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: 2.6.23-rc1-mm1:  boot hang on ia64 with memoryless nodes
  2007-07-26 13:53                   ` Lee Schermerhorn
@ 2007-07-26 14:00                     ` KAMEZAWA Hiroyuki
  2007-07-26 18:10                       ` Lee Schermerhorn
  2007-07-26 14:33                     ` Lee Schermerhorn
  1 sibling, 1 reply; 48+ messages in thread
From: KAMEZAWA Hiroyuki @ 2007-07-26 14:00 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: clameter, linux-ia64, kxr, akpm, linux-mm, bob.picco, mel,
	eric.whitney, apw

On Thu, 26 Jul 2007 09:53:27 -0400
Lee Schermerhorn <Lee.Schermerhorn@hp.com> wrote:

> On Wed, 2007-07-25 at 17:18 -0400, Lee Schermerhorn wrote: 
> > On Wed, 2007-07-25 at 12:38 -0700, Christoph Lameter wrote:
> > > (ccing Andy who did the work on the config stuff)
> > > 
> > > On Wed, 25 Jul 2007, Lee Schermerhorn wrote:
> > > 
> > > > I tried to deselect SPARSEMEM_VMEMMAP.  Kconfig's "def_bool=y" wouldn't
> > > > let me :-(.  After hacking the Kconfig and mm/sparse.c to allow that,
> > > > boot hangs with no error messages shortly after "Built N zonelists..."
> > > > message.
> > > 
> > > I get a similar hang here and see the system looping in softirq / hrtimer 
> > > code.
> > > 
> > > > Backed off to DISCONTIGMEM+VIRTUAL_MEMORY_MAP, and saw same hang as with
> > > > (SPARSMEM && !SPARSEMEM_VMEMMAP).   
> > > 
> > > So its not related to SPARSE VMEMMAP? General VMEMMAP issue on IA64?
> > 
> > This hang is different from the one I see with SPARSE VMEMMAP -- no
> > "Unable to handle kernel paging request..." message.  Just hangs after
> > "Built N zonelists..."  and some message about "color" that I didn't
> > capture.  Next time [:-(]...
> 
> The "color" message was actually:
> 
> Console:  colour dummy device 80x25
> 
> So, now I'm wondering if I'm hitting the "Regression in serial
> console..." issue, and the system was actually booting--I just didn't
> see any output.  If so, the "Unable to handle kernel paging request..."
> hang might well be a problem with SPARSEMEM_VMEMMAP...
> 
About SPARSEMEM_VMEMMAP try this:
http://lkml.org/lkml/2007/7/26/161

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: 2.6.23-rc1-mm1:  boot hang on ia64 with memoryless nodes
  2007-07-26 13:53                   ` Lee Schermerhorn
  2007-07-26 14:00                     ` KAMEZAWA Hiroyuki
@ 2007-07-26 14:33                     ` Lee Schermerhorn
  1 sibling, 0 replies; 48+ messages in thread
From: Lee Schermerhorn @ 2007-07-26 14:33 UTC (permalink / raw)
  To: Christoph Lameter, linux-ia64
  Cc: kxr, Andrew Morton, linux-mm, Bob Picco, Mel Gorman,
	Eric Whitney, Andy Whitcroft

On Thu, 2007-07-26 at 09:53 -0400, Lee Schermerhorn wrote:
> On Wed, 2007-07-25 at 17:18 -0400, Lee Schermerhorn wrote: 
> > On Wed, 2007-07-25 at 12:38 -0700, Christoph Lameter wrote:
> > > (ccing Andy who did the work on the config stuff)
> > > 
> > > On Wed, 25 Jul 2007, Lee Schermerhorn wrote:
> > > 
> > > > I tried to deselect SPARSEMEM_VMEMMAP.  Kconfig's "def_bool=y" wouldn't
> > > > let me :-(.  After hacking the Kconfig and mm/sparse.c to allow that,
> > > > boot hangs with no error messages shortly after "Built N zonelists..."
> > > > message.
> > > 
> > > I get a similar hang here and see the system looping in softirq / hrtimer 
> > > code.
> > > 
> > > > Backed off to DISCONTIGMEM+VIRTUAL_MEMORY_MAP, and saw same hang as with
> > > > (SPARSMEM && !SPARSEMEM_VMEMMAP).   
> > > 
> > > So its not related to SPARSE VMEMMAP? General VMEMMAP issue on IA64?
> > 
> > This hang is different from the one I see with SPARSE VMEMMAP -- no
> > "Unable to handle kernel paging request..." message.  Just hangs after
> > "Built N zonelists..."  and some message about "color" that I didn't
> > capture.  Next time [:-(]...
> 
> The "color" message was actually:
> 
> Console:  colour dummy device 80x25
> 
> So, now I'm wondering if I'm hitting the "Regression in serial
> console..." issue, and the system was actually booting--I just didn't
> see any output.  If so, the "Unable to handle kernel paging request..."
> hang might well be a problem with SPARSEMEM_VMEMMAP...
> 

After applying the hotfixes from Andrew's repository and the patches
from the mailing lists listed below, I'm booting 23-rc1-mm1 with a zx1
specific config [on an sx1000].  Gotta run for a meeting, but I'll try
generic kernel this pm.  Then back to testing memoryless node
patches, ...

Other "hot fixes":

2 of Mel Gorman's patches to fix ia64 mmap corruption [mm list]
Yasuaki Ishimatsu's assign irq vector fix [ia64 list]
Kenji Kaneshige's "wrong access to vector" patch [ia64 list]
Kame-san's sparsemem-vmemmap fix [lkml]

Lee


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: 2.6.23-rc1-mm1:  boot hang on ia64 with memoryless nodes
  2007-07-26 14:00                     ` KAMEZAWA Hiroyuki
@ 2007-07-26 18:10                       ` Lee Schermerhorn
  0 siblings, 0 replies; 48+ messages in thread
From: Lee Schermerhorn @ 2007-07-26 18:10 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: clameter, linux-ia64, kxr, akpm, linux-mm, bob.picco, mel,
	eric.whitney, apw

On Thu, 2007-07-26 at 23:00 +0900, KAMEZAWA Hiroyuki wrote:
> On Thu, 26 Jul 2007 09:53:27 -0400
> Lee Schermerhorn <Lee.Schermerhorn@hp.com> wrote:
> 
> > On Wed, 2007-07-25 at 17:18 -0400, Lee Schermerhorn wrote: 
> > > On Wed, 2007-07-25 at 12:38 -0700, Christoph Lameter wrote:
> > > > (ccing Andy who did the work on the config stuff)
> > > > 
> > > > On Wed, 25 Jul 2007, Lee Schermerhorn wrote:
> > > > 
> > > > > I tried to deselect SPARSEMEM_VMEMMAP.  Kconfig's "def_bool=y" wouldn't
> > > > > let me :-(.  After hacking the Kconfig and mm/sparse.c to allow that,
> > > > > boot hangs with no error messages shortly after "Built N zonelists..."
> > > > > message.
> > > > 
> > > > I get a similar hang here and see the system looping in softirq / hrtimer 
> > > > code.
> > > > 
> > > > > Backed off to DISCONTIGMEM+VIRTUAL_MEMORY_MAP, and saw same hang as with
> > > > > (SPARSMEM && !SPARSEMEM_VMEMMAP).   
> > > > 
> > > > So its not related to SPARSE VMEMMAP? General VMEMMAP issue on IA64?
> > > 
> > > This hang is different from the one I see with SPARSE VMEMMAP -- no
> > > "Unable to handle kernel paging request..." message.  Just hangs after
> > > "Built N zonelists..."  and some message about "color" that I didn't
> > > capture.  Next time [:-(]...
> > 
> > The "color" message was actually:
> > 
> > Console:  colour dummy device 80x25
> > 
> > So, now I'm wondering if I'm hitting the "Regression in serial
> > console..." issue, and the system was actually booting--I just didn't
> > see any output.  If so, the "Unable to handle kernel paging request..."
> > hang might well be a problem with SPARSEMEM_VMEMMAP...
> > 
> About SPARSEMEM_VMEMMAP try this:
> http://lkml.org/lkml/2007/7/26/161
> 

Kame-san:

Thank you.  This solved my problem.  I can now boot with both zx1 and
[with other patches from the mailing lists], generic kernels on my ia64
platform.

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH take3] Memoryless nodes:  use "node_memory_map" for cpuset mems_allowed validation
  2007-07-24 20:30     ` [PATCH take3] " Lee Schermerhorn
  2007-07-25 15:53       ` Nishanth Aravamudan
  2007-07-25 22:00       ` Nishanth Aravamudan
@ 2007-07-27  0:40       ` Nishanth Aravamudan
  2007-07-27 14:15         ` Lee Schermerhorn
  2 siblings, 1 reply; 48+ messages in thread
From: Nishanth Aravamudan @ 2007-07-27  0:40 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Christoph Lameter, Paul Jackson, akpm, kxr, linux-mm, KAMEZAWA Hiroyuki

On 24.07.2007 [16:30:19 -0400], Lee Schermerhorn wrote:
> Memoryless Nodes:  use "node_memory_map" for cpusets - take 3
> 
> Against 2.6.22-rc6-mm1 atop Christoph Lameter's memoryless nodes
> series
> 
> take 2:
> + replaced node_online_map in cpuset_current_mems_allowed()
>   with node_states[N_MEMORY]
> + replaced node_online_map in cpuset_init_smp() with
>   node_states[N_MEMORY]
> 
> take 3:
> + fix up comments and top level cpuset tracking of nodes
>   with memory [instead of on-line nodes].
> + maybe I got them all this time?
> 
> cpusets try to ensure that any node added to a cpuset's 
> mems_allowed is on-line and contains memory.  The assumption
> was that online nodes contained memory.  Thus, it is possible
> to add memoryless nodes to a cpuset and then add tasks to this
> cpuset.  This results in continuous series of oom-kill and
> apparent system hang.
> 
> Change cpusets to use node_states[N_MEMORY] [a.k.a.
> node_memory_map] in place of node_online_map when vetting 
> memories.  Return error if admin attempts to write a non-empty
> mems_allowed node mask containing only memoryless-nodes.
> 
> Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>
> 
>  include/linux/cpuset.h |    2 -
>  kernel/cpuset.c        |   51 +++++++++++++++++++++++++++++++------------------
>  2 files changed, 34 insertions(+), 19 deletions(-)

Small typo fix which prevents build with !CPUSETS.

---
FYI: I noticed that oldconfig on 2.6.23-rc1-mm1 with CPUSETS=y disables
CPUSETS because of the introduction of CONFIG_CONTAINERS :(

Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index f8f4f68..d01b1bc 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -92,7 +92,7 @@ static inline nodemask_t cpuset_mems_allowed(struct task_struct *p)
 	return node_possible_map;
 }
 
-#define cpuset_current_mems_allowed (node_states[N_MEMORY))
+#define cpuset_current_mems_allowed (node_states[N_MEMORY])
 static inline void cpuset_init_current_mems_allowed(void) {}
 static inline void cpuset_update_task_memory_state(void) {}
 #define cpuset_nodes_subset_current_mems_allowed(nodes) (1)

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH take3] Memoryless nodes:  use "node_memory_map" for cpuset mems_allowed validation
  2007-07-27  0:40       ` Nishanth Aravamudan
@ 2007-07-27 14:15         ` Lee Schermerhorn
  0 siblings, 0 replies; 48+ messages in thread
From: Lee Schermerhorn @ 2007-07-27 14:15 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Christoph Lameter, Paul Jackson, akpm, kxr, linux-mm,
	KAMEZAWA Hiroyuki, Bob Picco

On Thu, 2007-07-26 at 17:40 -0700, Nishanth Aravamudan wrote:
> On 24.07.2007 [16:30:19 -0400], Lee Schermerhorn wrote:
> > Memoryless Nodes:  use "node_memory_map" for cpusets - take 3
> > 
> > Against 2.6.22-rc6-mm1 atop Christoph Lameter's memoryless nodes
> > series
> > 
> > take 2:
> > + replaced node_online_map in cpuset_current_mems_allowed()
> >   with node_states[N_MEMORY]
> > + replaced node_online_map in cpuset_init_smp() with
> >   node_states[N_MEMORY]
> > 
> > take 3:
> > + fix up comments and top level cpuset tracking of nodes
> >   with memory [instead of on-line nodes].
> > + maybe I got them all this time?
> > 
> > cpusets try to ensure that any node added to a cpuset's 
> > mems_allowed is on-line and contains memory.  The assumption
> > was that online nodes contained memory.  Thus, it is possible
> > to add memoryless nodes to a cpuset and then add tasks to this
> > cpuset.  This results in continuous series of oom-kill and
> > apparent system hang.
> > 
> > Change cpusets to use node_states[N_MEMORY] [a.k.a.
> > node_memory_map] in place of node_online_map when vetting 
> > memories.  Return error if admin attempts to write a non-empty
> > mems_allowed node mask containing only memoryless-nodes.
> > 
> > Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>
> > 
> >  include/linux/cpuset.h |    2 -
> >  kernel/cpuset.c        |   51 +++++++++++++++++++++++++++++++------------------
> >  2 files changed, 34 insertions(+), 19 deletions(-)
> 
> Small typo fix which prevents build with !CPUSETS.
> 
> ---
> FYI: I noticed that oldconfig on 2.6.23-rc1-mm1 with CPUSETS=y disables
> CPUSETS because of the introduction of CONFIG_CONTAINERS :(
> 
> Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
> 
> diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
> index f8f4f68..d01b1bc 100644
> --- a/include/linux/cpuset.h
> +++ b/include/linux/cpuset.h
> @@ -92,7 +92,7 @@ static inline nodemask_t cpuset_mems_allowed(struct task_struct *p)
>  	return node_possible_map;
>  }
>  
> -#define cpuset_current_mems_allowed (node_states[N_MEMORY))
> +#define cpuset_current_mems_allowed (node_states[N_MEMORY])
>  static inline void cpuset_init_current_mems_allowed(void) {}
>  static inline void cpuset_update_task_memory_state(void) {}
>  #define cpuset_nodes_subset_current_mems_allowed(nodes) (1)
> 

Thanks, Nish.  Bob Picco pointed that out to me and I've fixed it in
take4.  Bob is reviewing the patches and should get back to me today if
he has any issues.  I've tested the current patches against 23-rc1-mm1
overnight on the following config and it held up fine.

Configuration:  100% cell local memory, boot with mem=16g [out of 32G
available].  Gave me one memoryless node, and one very small node:

available: 5 nodes (0-4)
node 0 size: 7600 MB
node 0 free: 6647 MB
node 1 size: 8127 MB
node 1 free: 7675 MB
node 2 size: 144 MB
node 2 free: 94 MB
node 3 size: 0 MB
node 3 free: 0 MB
node 4 size: 511 MB
node 4 free: 494 MB

Ran test exerciser [custom workload in Dave Anderson's "usex" program]
in a cpuset containing cpus and memory from nodes 1-3].  Not a lot of
mempolicy testing, but fairly stressful, otherwise.

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2007-07-27 14:15 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20070711182219.234782227@sgi.com>
     [not found] ` <20070711182252.138829364@sgi.com>
2007-07-11 18:46   ` [patch 10/12] Memoryless nodes: Update memory policy and page migration Nishanth Aravamudan
2007-07-11 18:56     ` Christoph Lameter
     [not found] ` <20070711182252.376540447@sgi.com>
2007-07-11 19:04   ` [patch 11/12] Add N_CPU node state Christoph Lameter
     [not found] ` <20070711182250.005856256@sgi.com>
2007-07-11 19:06   ` [patch 01/12] NUMA: Generic management of nodemasks for various purposes Christoph Lameter
2007-07-11 19:32     ` Lee Schermerhorn
2007-07-20 20:49     ` [PATCH] Memoryless nodes: use "node_memory_map" for cpuset mems_allowed validation Lee Schermerhorn
2007-07-20 22:07       ` Nishanth Aravamudan
2007-07-23 19:09       ` Nishanth Aravamudan
2007-07-23 19:23         ` Paul Jackson
2007-07-23 20:08           ` Nishanth Aravamudan
2007-07-23 20:59         ` Lee Schermerhorn
2007-07-23 21:48           ` Nishanth Aravamudan
2007-07-24 14:11             ` Lee Schermerhorn
2007-07-24 16:16               ` Nishanth Aravamudan
2007-07-24 14:15     ` [PATCH take2] " Lee Schermerhorn
2007-07-24 16:19       ` Nishanth Aravamudan
2007-07-24 19:01         ` Lee Schermerhorn
2007-07-25 15:50           ` Nishanth Aravamudan
2007-07-24 20:30     ` [PATCH take3] " Lee Schermerhorn
2007-07-25 15:53       ` Nishanth Aravamudan
2007-07-25 22:00       ` Nishanth Aravamudan
2007-07-26 13:04         ` Lee Schermerhorn
2007-07-27  0:40       ` Nishanth Aravamudan
2007-07-27 14:15         ` Lee Schermerhorn
2007-07-24 20:35     ` [PATCH/RFC] Memoryless nodes: Suppress redundant "node with no memory" messages Lee Schermerhorn
2007-07-25 15:56       ` Nishanth Aravamudan
     [not found] ` <20070711182251.433134748@sgi.com>
2007-07-12  0:07   ` [patch 07/12] Memoryless nodes: SLUB support Andrew Morton
2007-07-12  1:42     ` Christoph Lameter
2007-07-12 18:33       ` Nishanth Aravamudan
2007-07-12 18:38         ` Christoph Lameter
2007-07-13 15:14 ` [patch 00/12] NUMA: Memoryless node support V3 Nishanth Aravamudan
2007-07-13 16:43   ` Christoph Lameter
2007-07-13 16:52     ` Nishanth Aravamudan
2007-07-13 17:20     ` Lee Schermerhorn
2007-07-13 17:23       ` Christoph Lameter
2007-07-13 19:22         ` Lee Schermerhorn
2007-07-13 20:53         ` Lee Schermerhorn
2007-07-13 21:34           ` Christoph Lameter
2007-07-13 23:18           ` Nishanth Aravamudan
     [not found]     ` <1185310277.5649.90.camel@localhost>
     [not found]       ` <Pine.LNX.4.64.0707241402010.4773@schroedinger.engr.sgi.com>
     [not found]         ` <1185372692.5604.22.camel@localhost>
2007-07-25 15:45           ` Lee Schermerhorn
2007-07-25 19:16             ` 2.6.23-rc1-mm1: boot hang on ia64 with memoryless nodes Lee Schermerhorn
2007-07-25 19:38               ` Christoph Lameter
2007-07-25 20:03                 ` Christoph Lameter
2007-07-25 21:18                 ` Lee Schermerhorn
2007-07-26 13:53                   ` Lee Schermerhorn
2007-07-26 14:00                     ` KAMEZAWA Hiroyuki
2007-07-26 18:10                       ` Lee Schermerhorn
2007-07-26 14:33                     ` Lee Schermerhorn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox