[PATCH 00/14] NUMA: Memoryless node support V4

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 00/14] NUMA: Memoryless node support V4
@ 2007-07-27 19:43 Lee Schermerhorn
  2007-07-27 19:43 ` [PATCH 01/14] NUMA: Generic management of nodemasks for various purposes Lee Schermerhorn
                   ` (16 more replies)
  0 siblings, 17 replies; 68+ messages in thread
From: Lee Schermerhorn @ 2007-07-27 19:43 UTC (permalink / raw)
  To: linux-mm
  Cc: ak, Lee Schermerhorn, Nishanth Aravamudan, pj, kxr,
	Christoph Lameter, Mel Gorman, akpm, KAMEZAWA Hiroyuki

Changes V3->V4:
- Refresh against 23-rc1-mm1
- teach cpusets about memoryless nodes.

Changes V2->V3:
- Refresh patches (sigh)
- Add comments suggested by Kamezawa Hiroyuki
- Add signoff by Jes Sorensen

Changes V1->V2:
- Add a generic layer that allows the definition of additional node bitmaps

This patchset is implementing additional node bitmaps that allow the system
to track nodes that are online without memory and nodes that have processors.

In various subsystems we can use that information to customize VM behavior.


We define a number of node states that we track in enum node_states

/*
 * Bitmasks that are kept for all the nodes.
 */
enum node_states {
	N_POSSIBLE,     /* The node could become online at some point */
	N_ONLINE,       /* The node is online */
	N_MEMORY,       /* The node has memory */
	N_CPU,          /* The node has cpus */
	NR_NODE_STATES
};

and define operations using the node states:

static inline int node_state(int node, enum node_states state)
{
        return node_isset(node, node_states[state]);
}

static inline void node_set_state(int node, enum node_states state)
{
        __node_set(node, &node_states[state]);
}

static inline void node_clear_state(int node, enum node_states state)
{
        __node_clear(node, &node_states[state]);
}

static inline int num_node_state(enum node_states state)
{
        return nodes_weight(node_states[state]);
}

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH 01/14] NUMA: Generic management of nodemasks for various purposes
  2007-07-27 19:43 [PATCH 00/14] NUMA: Memoryless node support V4 Lee Schermerhorn
@ 2007-07-27 19:43 ` Lee Schermerhorn
  2007-07-30 21:38   ` [PATCH/RFC] 2.6.23-rc1-mm1: MPOL_PREFERRED fixups for preferred_node < 0 Lee Schermerhorn
  2007-08-01  2:22   ` [PATCH 01/14] NUMA: Generic management of nodemasks for various purposes Andrew Morton
  2007-07-27 19:43 ` [PATCH 02/14] Memoryless nodes: introduce mask of nodes with memory Lee Schermerhorn
                   ` (15 subsequent siblings)
  16 siblings, 2 replies; 68+ messages in thread
From: Lee Schermerhorn @ 2007-07-27 19:43 UTC (permalink / raw)
  To: linux-mm
  Cc: ak, Lee Schermerhorn, Nishanth Aravamudan, pj, kxr,
	Christoph Lameter, Mel Gorman, akpm, KAMEZAWA Hiroyuki

[patch 1/14] NUMA: Generic management of nodemasks for various purposes

Preparation for memoryless node patches.

Provide a generic way to keep nodemasks describing various characteristics
of NUMA nodes.

Remove the node_online_map and the node_possible map and realize the whole
thing using two nodes stats: N_POSSIBLE and N_ONLINE.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Tested-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>
Acked-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>
Acked-by: Bob Picco <bob.picco@hp.com>

 include/linux/nodemask.h |   87 ++++++++++++++++++++++++++++++++++++++---------
 mm/page_alloc.c          |   13 +++----
 2 files changed, 78 insertions(+), 22 deletions(-)

Index: Linux/include/linux/nodemask.h
===================================================================
--- Linux.orig/include/linux/nodemask.h	2007-07-08 19:32:17.000000000 -0400
+++ Linux/include/linux/nodemask.h	2007-07-25 11:36:25.000000000 -0400
@@ -338,31 +338,81 @@ static inline void __nodes_remap(nodemas
 #endif /* MAX_NUMNODES */
 
 /*
+ * Bitmasks that are kept for all the nodes.
+ */
+enum node_states {
+	N_POSSIBLE,	/* The node could become online at some point */
+	N_ONLINE,	/* The node is online */
+	NR_NODE_STATES
+};
+
+/*
  * The following particular system nodemasks and operations
  * on them manage all possible and online nodes.
  */
 
-extern nodemask_t node_online_map;
-extern nodemask_t node_possible_map;
+extern nodemask_t node_states[NR_NODE_STATES];
 
 #if MAX_NUMNODES > 1
-#define num_online_nodes()	nodes_weight(node_online_map)
-#define num_possible_nodes()	nodes_weight(node_possible_map)
-#define node_online(node)	node_isset((node), node_online_map)
-#define node_possible(node)	node_isset((node), node_possible_map)
-#define first_online_node	first_node(node_online_map)
-#define next_online_node(nid)	next_node((nid), node_online_map)
+static inline int node_state(int node, enum node_states state)
+{
+	return node_isset(node, node_states[state]);
+}
+
+static inline void node_set_state(int node, enum node_states state)
+{
+	__node_set(node, &node_states[state]);
+}
+
+static inline void node_clear_state(int node, enum node_states state)
+{
+	__node_clear(node, &node_states[state]);
+}
+
+static inline int num_node_state(enum node_states state)
+{
+	return nodes_weight(node_states[state]);
+}
+
+#define for_each_node_state(__node, __state) \
+	for_each_node_mask((__node), node_states[__state])
+
+#define first_online_node	first_node(node_states[N_ONLINE])
+#define next_online_node(nid)	next_node((nid), node_states[N_ONLINE])
+
 extern int nr_node_ids;
 #else
-#define num_online_nodes()	1
-#define num_possible_nodes()	1
-#define node_online(node)	((node) == 0)
-#define node_possible(node)	((node) == 0)
+
+static inline int node_state(int node, enum node_states state)
+{
+	return node == 0;
+}
+
+static inline void node_set_state(int node, enum node_states state)
+{
+}
+
+static inline void node_clear_state(int node, enum node_states state)
+{
+}
+
+static inline int num_node_state(enum node_states state)
+{
+	return 1;
+}
+
+#define for_each_node_state(node, __state) \
+	for ( (node) = 0; (node) != 0; (node) = 1)
+
 #define first_online_node	0
 #define next_online_node(nid)	(MAX_NUMNODES)
 #define nr_node_ids		1
+
 #endif
 
+#define node_online_map 	node_states[N_ONLINE]
+#define node_possible_map 	node_states[N_POSSIBLE]
+
 #define any_online_node(mask)			\
 ({						\
 	int node;				\
@@ -372,10 +422,15 @@ extern int nr_node_ids;
 	node;					\
 })
 
-#define node_set_online(node)	   set_bit((node), node_online_map.bits)
-#define node_set_offline(node)	   clear_bit((node), node_online_map.bits)
+#define num_online_nodes()	num_node_state(N_ONLINE)
+#define num_possible_nodes()	num_node_state(N_POSSIBLE)
+#define node_online(node)	node_state((node), N_ONLINE)
+#define node_possible(node)	node_state((node), N_POSSIBLE)
+
+#define node_set_online(node)	   node_set_state((node), N_ONLINE)
+#define node_set_offline(node)	   node_clear_state((node), N_ONLINE)
 
-#define for_each_node(node)	   for_each_node_mask((node), node_possible_map)
-#define for_each_online_node(node) for_each_node_mask((node), node_online_map)
+#define for_each_node(node)	   for_each_node_state(node, N_POSSIBLE)
+#define for_each_online_node(node) for_each_node_state(node, N_ONLINE)
 
 #endif /* __LINUX_NODEMASK_H */
Index: Linux/mm/page_alloc.c
===================================================================
--- Linux.orig/mm/page_alloc.c	2007-07-25 09:29:50.000000000 -0400
+++ Linux/mm/page_alloc.c	2007-07-25 11:36:25.000000000 -0400
@@ -48,13 +48,14 @@
 #include "internal.h"
 
 /*
- * MCD - HACK: Find somewhere to initialize this EARLY, or make this
- * initializer cleaner
+ * Array of node states.
  */
-nodemask_t node_online_map __read_mostly = { { [0] = 1UL } };
-EXPORT_SYMBOL(node_online_map);
-nodemask_t node_possible_map __read_mostly = NODE_MASK_ALL;
-EXPORT_SYMBOL(node_possible_map);
+nodemask_t node_states[NR_NODE_STATES] __read_mostly = {
+	[N_POSSIBLE] = NODE_MASK_ALL,
+	[N_ONLINE] = { { [0] = 1UL } }
+};
+EXPORT_SYMBOL(node_states);
+
 unsigned long totalram_pages __read_mostly;
 unsigned long totalreserve_pages __read_mostly;
 long nr_swap_pages;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH 02/14] Memoryless nodes:  introduce mask of nodes with memory
  2007-07-27 19:43 [PATCH 00/14] NUMA: Memoryless node support V4 Lee Schermerhorn
  2007-07-27 19:43 ` [PATCH 01/14] NUMA: Generic management of nodemasks for various purposes Lee Schermerhorn
@ 2007-07-27 19:43 ` Lee Schermerhorn
  2007-07-27 19:43 ` [PATCH 03/14] Memoryless Nodes: Fix interleave behavior Lee Schermerhorn
                   ` (14 subsequent siblings)
  16 siblings, 0 replies; 68+ messages in thread
From: Lee Schermerhorn @ 2007-07-27 19:43 UTC (permalink / raw)
  To: linux-mm
  Cc: ak, Lee Schermerhorn, Nishanth Aravamudan, pj, kxr,
	Christoph Lameter, Mel Gorman, akpm, KAMEZAWA Hiroyuki

[patch 2/14] Memoryless nodes:  introduce mask of nodes with memory

It is necessary to know if nodes have memory since we have recently
begun to add support for memoryless nodes. For that purpose we introduce
a new node state N_MEMORY.

A node has its bit in node_memory_map set if it has memory. If a node
has memory then it has at least one zone defined in its pgdat structure
that is located in the pgdat itself.

N_MEMORY can then be used in various places to insure that we
do the right thing when we encounter a memoryless node.

Signed-off-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Tested-off-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Acked-by: Bob Picco <bob.picco@hp.com>

 include/linux/nodemask.h |    1 +
 mm/page_alloc.c          |    9 +++++++--
 2 files changed, 8 insertions(+), 2 deletions(-)

Index: Linux/include/linux/nodemask.h
===================================================================
--- Linux.orig/include/linux/nodemask.h	2007-07-25 11:36:25.000000000 -0400
+++ Linux/include/linux/nodemask.h	2007-07-25 11:36:27.000000000 -0400
@@ -343,6 +343,7 @@ static inline void __nodes_remap(nodemas
 enum node_states {
 	N_POSSIBLE,	/* The node could become online at some point */
 	N_ONLINE,	/* The node is online */
+	N_MEMORY,	/* The node has memory */
 	NR_NODE_STATES
 };
 
Index: Linux/mm/page_alloc.c
===================================================================
--- Linux.orig/mm/page_alloc.c	2007-07-25 11:36:25.000000000 -0400
+++ Linux/mm/page_alloc.c	2007-07-25 11:36:27.000000000 -0400
@@ -2387,8 +2387,13 @@ static int __build_all_zonelists(void *d
 	int nid;
 
 	for_each_online_node(nid) {
-		build_zonelists(NODE_DATA(nid));
-		build_zonelist_cache(NODE_DATA(nid));
+		pg_data_t *pgdat = NODE_DATA(nid);
+
+		build_zonelists(pgdat);
+		build_zonelist_cache(pgdat);
+
+		if (pgdat->node_present_pages)
+			node_set_state(nid, N_MEMORY);
 	}
 	return 0;
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH 03/14] Memoryless Nodes: Fix interleave behavior
  2007-07-27 19:43 [PATCH 00/14] NUMA: Memoryless node support V4 Lee Schermerhorn
  2007-07-27 19:43 ` [PATCH 01/14] NUMA: Generic management of nodemasks for various purposes Lee Schermerhorn
  2007-07-27 19:43 ` [PATCH 02/14] Memoryless nodes: introduce mask of nodes with memory Lee Schermerhorn
@ 2007-07-27 19:43 ` Lee Schermerhorn
  2007-07-27 19:43 ` [PATCH 04/14] OOM: use the N_MEMORY map instead of constructing one on the fly Lee Schermerhorn
                   ` (13 subsequent siblings)
  16 siblings, 0 replies; 68+ messages in thread
From: Lee Schermerhorn @ 2007-07-27 19:43 UTC (permalink / raw)
  To: linux-mm
  Cc: ak, Lee Schermerhorn, Nishanth Aravamudan, pj, kxr,
	Christoph Lameter, Mel Gorman, akpm, KAMEZAWA Hiroyuki

[patch 3/14] Memoryless Nodes: Fix interleave behavior for memoryless nodes

MPOL_INTERLEAVE currently simply loops over all nodes. Allocations on
memoryless nodes will be redirected to nodes with memory. This results in
an imbalance because the neighboring nodes to memoryless nodes will get significantly
more interleave hits that the rest of the nodes on the system.

We can avoid this imbalance by clearing the nodes in the interleave node
set that have no memory. If we use the node map of the memory nodes
instead of the online nodes then we have only the nodes we want.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
Tested-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>
Acked-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>
Acked-by: Bob Picco <bob.picco@hp.com>

 mm/mempolicy.c |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

Index: Linux/mm/mempolicy.c
===================================================================
--- Linux.orig/mm/mempolicy.c	2007-07-25 09:29:50.000000000 -0400
+++ Linux/mm/mempolicy.c	2007-07-25 11:36:30.000000000 -0400
@@ -183,7 +183,9 @@ static struct mempolicy *mpol_new(int mo
 	switch (mode) {
 	case MPOL_INTERLEAVE:
 		policy->v.nodes = *nodes;
-		if (nodes_weight(*nodes) == 0) {
+		nodes_and(policy->v.nodes, policy->v.nodes,
+					node_states[N_MEMORY]);
+		if (nodes_weight(policy->v.nodes) == 0) {
 			kmem_cache_free(policy_cache, policy);
 			return ERR_PTR(-EINVAL);
 		}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH 04/14] OOM: use the N_MEMORY map instead of constructing one on the fly
  2007-07-27 19:43 [PATCH 00/14] NUMA: Memoryless node support V4 Lee Schermerhorn
                   ` (2 preceding siblings ...)
  2007-07-27 19:43 ` [PATCH 03/14] Memoryless Nodes: Fix interleave behavior Lee Schermerhorn
@ 2007-07-27 19:43 ` Lee Schermerhorn
  2007-07-27 19:43 ` [PATCH 05/14] Memoryless Nodes: No need for kswapd Lee Schermerhorn
                   ` (12 subsequent siblings)
  16 siblings, 0 replies; 68+ messages in thread
From: Lee Schermerhorn @ 2007-07-27 19:43 UTC (permalink / raw)
  To: linux-mm
  Cc: ak, Lee Schermerhorn, Nishanth Aravamudan, pj, kxr,
	Christoph Lameter, Mel Gorman, akpm, KAMEZAWA Hiroyuki

[patch 04/14] OOM: use the N_MEMORY map instead of constructing one on the fly

constrained_alloc() builds its own memory map for nodes with memory.
We have that available in node_memory_map now. So simplify the code.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Acked-by: Nishanth Aravamudan <nacc@us.ibm.com>
Acked-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Acked-by: Bob Picco <bob.picco@hp.com>

 mm/oom_kill.c |    9 +--------
 1 file changed, 1 insertion(+), 8 deletions(-)

Index: Linux/mm/oom_kill.c
===================================================================
--- Linux.orig/mm/oom_kill.c	2007-07-26 12:40:17.000000000 -0400
+++ Linux/mm/oom_kill.c	2007-07-27 08:59:31.000000000 -0400
@@ -176,14 +176,7 @@ static inline int constrained_alloc(stru
 {
 #ifdef CONFIG_NUMA
 	struct zone **z;
-	nodemask_t nodes;
-	int node;
-
-	nodes_clear(nodes);
-	/* node has memory ? */
-	for_each_online_node(node)
-		if (NODE_DATA(node)->node_present_pages)
-			node_set(node, nodes);
+	nodemask_t nodes = node_states[N_MEMORY];
 
 	for (z = zonelist->zones; *z; z++)
 		if (cpuset_zone_allowed_softwall(*z, gfp_mask))

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH 05/14] Memoryless Nodes: No need for kswapd
  2007-07-27 19:43 [PATCH 00/14] NUMA: Memoryless node support V4 Lee Schermerhorn
                   ` (3 preceding siblings ...)
  2007-07-27 19:43 ` [PATCH 04/14] OOM: use the N_MEMORY map instead of constructing one on the fly Lee Schermerhorn
@ 2007-07-27 19:43 ` Lee Schermerhorn
  2007-07-27 19:43 ` [PATCH 06/14] Memoryless Node: Slab support Lee Schermerhorn
                   ` (11 subsequent siblings)
  16 siblings, 0 replies; 68+ messages in thread
From: Lee Schermerhorn @ 2007-07-27 19:43 UTC (permalink / raw)
  To: linux-mm
  Cc: ak, Lee Schermerhorn, Nishanth Aravamudan, pj, kxr,
	Christoph Lameter, Mel Gorman, akpm, KAMEZAWA Hiroyuki

[patch 05/14] Memoryless Nodes: No need for kswapd

A node without memory does not need a kswapd. So use the memory map instead
of the online map when starting kswapd.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Acked-by: Nishanth Aravamudan <nacc@us.ibm.com>
Tested-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>
Acked-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Acked-by: Bob Picco <bob.picco@hp.com>

 mm/vmscan.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: Linux/mm/vmscan.c
===================================================================
--- Linux.orig/mm/vmscan.c	2007-07-25 09:29:50.000000000 -0400
+++ Linux/mm/vmscan.c	2007-07-25 11:36:35.000000000 -0400
@@ -1716,7 +1716,7 @@ static int __init kswapd_init(void)
 	int nid;
 
 	swap_setup();
-	for_each_online_node(nid)
+	for_each_node_state(nid, N_MEMORY)
  		kswapd_run(nid);
 	hotcpu_notifier(cpu_callback, 0);
 	return 0;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH 06/14] Memoryless Node: Slab support
  2007-07-27 19:43 [PATCH 00/14] NUMA: Memoryless node support V4 Lee Schermerhorn
                   ` (4 preceding siblings ...)
  2007-07-27 19:43 ` [PATCH 05/14] Memoryless Nodes: No need for kswapd Lee Schermerhorn
@ 2007-07-27 19:43 ` Lee Schermerhorn
  2007-07-27 19:44 ` [PATCH 07/14] Memoryless nodes: SLUB support Lee Schermerhorn
                   ` (10 subsequent siblings)
  16 siblings, 0 replies; 68+ messages in thread
From: Lee Schermerhorn @ 2007-07-27 19:43 UTC (permalink / raw)
  To: linux-mm
  Cc: ak, Lee Schermerhorn, Nishanth Aravamudan, pj, kxr,
	Christoph Lameter, Mel Gorman, akpm, KAMEZAWA Hiroyuki

[patch 06/14] Memoryless Node: Slab support

Slab should not allocate control structures for nodes without memory.
This may seem to work right now but its unreliable since not all
allocations can fall back due to the use of GFP_THISNODE.

Switching a few for_each_online_node's to for_each_memory_node will
allow us to only allocate for nodes that actually have memory.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Acked-by: Nishanth Aravamudan <nacc@us.ibm.com>
Acked-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Acked-by: Bob Picco <bob.picco@hp.com>

 mm/slab.c |    8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

Index: Linux/mm/slab.c
===================================================================
--- Linux.orig/mm/slab.c	2007-07-25 09:29:50.000000000 -0400
+++ Linux/mm/slab.c	2007-07-25 11:36:37.000000000 -0400
@@ -1565,7 +1565,7 @@ void __init kmem_cache_init(void)
 		/* Replace the static kmem_list3 structures for the boot cpu */
 		init_list(&cache_cache, &initkmem_list3[CACHE_CACHE], node);
 
-		for_each_online_node(nid) {
+		for_each_node_state(nid, N_MEMORY) {
 			init_list(malloc_sizes[INDEX_AC].cs_cachep,
 				  &initkmem_list3[SIZE_AC + nid], nid);
 
@@ -1943,7 +1943,7 @@ static void __init set_up_list3s(struct 
 {
 	int node;
 
-	for_each_online_node(node) {
+	for_each_node_state(node, N_MEMORY) {
 		cachep->nodelists[node] = &initkmem_list3[index + node];
 		cachep->nodelists[node]->next_reap = jiffies +
 		    REAPTIMEOUT_LIST3 +
@@ -2074,7 +2074,7 @@ static int __init_refok setup_cpu_cache(
 			g_cpucache_up = PARTIAL_L3;
 		} else {
 			int node;
-			for_each_online_node(node) {
+			for_each_node_state(node, N_MEMORY) {
 				cachep->nodelists[node] =
 				    kmalloc_node(sizeof(struct kmem_list3),
 						GFP_KERNEL, node);
@@ -3784,7 +3784,7 @@ static int alloc_kmemlist(struct kmem_ca
 	struct array_cache *new_shared;
 	struct array_cache **new_alien = NULL;
 
-	for_each_online_node(node) {
+	for_each_node_state(node, N_MEMORY) {
 
                 if (use_alien_caches) {
                         new_alien = alloc_alien_cache(node, cachep->limit);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH 07/14] Memoryless nodes: SLUB support
  2007-07-27 19:43 [PATCH 00/14] NUMA: Memoryless node support V4 Lee Schermerhorn
                   ` (5 preceding siblings ...)
  2007-07-27 19:43 ` [PATCH 06/14] Memoryless Node: Slab support Lee Schermerhorn
@ 2007-07-27 19:44 ` Lee Schermerhorn
  2007-07-27 19:44 ` [PATCH 08/14] Uncached allocator: Handle memoryless nodes Lee Schermerhorn
                   ` (9 subsequent siblings)
  16 siblings, 0 replies; 68+ messages in thread
From: Lee Schermerhorn @ 2007-07-27 19:44 UTC (permalink / raw)
  To: linux-mm
  Cc: ak, Lee Schermerhorn, Nishanth Aravamudan, pj, kxr,
	Christoph Lameter, Mel Gorman, akpm, KAMEZAWA Hiroyuki

[patch 07/14] Memoryless nodes: SLUB support

Memoryless nodes: SLUB support

Simply switch all for_each_online_node to for_each_memory_node. That way
SLUB only operates on nodes with memory. Any allocation attempt on a
memoryless node will fall whereupon SLUB will fetch memory from a nearby
node (depending on how memory policies and cpuset describe fallback).

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Tested-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>
Acked-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Acked-by: Bob Picco <bob.picco@hp.com>

 mm/slub.c |   16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

Index: Linux/mm/slub.c
===================================================================
--- Linux.orig/mm/slub.c	2007-07-25 09:29:50.000000000 -0400
+++ Linux/mm/slub.c	2007-07-25 11:37:28.000000000 -0400
@@ -1918,7 +1918,7 @@ static void free_kmem_cache_nodes(struct
 {
 	int node;
 
-	for_each_online_node(node) {
+	for_each_node_state(node, N_MEMORY) {
 		struct kmem_cache_node *n = s->node[node];
 		if (n && n != &s->local_node)
 			kmem_cache_free(kmalloc_caches, n);
@@ -1936,7 +1936,7 @@ static int init_kmem_cache_nodes(struct 
 	else
 		local_node = 0;
 
-	for_each_online_node(node) {
+	for_each_node_state(node, N_MEMORY) {
 		struct kmem_cache_node *n;
 
 		if (local_node == node)
@@ -2189,7 +2189,7 @@ static inline int kmem_cache_close(struc
 	flush_all(s);
 
 	/* Attempt to free all objects */
-	for_each_online_node(node) {
+	for_each_node_state(node, N_MEMORY) {
 		struct kmem_cache_node *n = get_node(s, node);
 
 		n->nr_partial -= free_list(s, n, &n->partial);
@@ -2484,7 +2484,7 @@ int kmem_cache_shrink(struct kmem_cache 
 		return -ENOMEM;
 
 	flush_all(s);
-	for_each_online_node(node) {
+	for_each_node_state(node, N_MEMORY) {
 		n = get_node(s, node);
 
 		if (!n->nr_partial)
@@ -2884,7 +2884,7 @@ static long validate_slab_cache(struct k
 		return -ENOMEM;
 
 	flush_all(s);
-	for_each_online_node(node) {
+	for_each_node_state(node, N_MEMORY) {
 		struct kmem_cache_node *n = get_node(s, node);
 
 		count += validate_slab_node(s, n, map);
@@ -3104,7 +3104,7 @@ static int list_locations(struct kmem_ca
 	/* Push back cpu slabs */
 	flush_all(s);
 
-	for_each_online_node(node) {
+	for_each_node_state(node, N_MEMORY) {
 		struct kmem_cache_node *n = get_node(s, node);
 		unsigned long flags;
 		struct page *page;
@@ -3231,7 +3231,7 @@ static unsigned long slab_objects(struct
 		}
 	}
 
-	for_each_online_node(node) {
+	for_each_node_state(node, N_MEMORY) {
 		struct kmem_cache_node *n = get_node(s, node);
 
 		if (flags & SO_PARTIAL) {
@@ -3259,7 +3259,7 @@ static unsigned long slab_objects(struct
 
 	x = sprintf(buf, "%lu", total);
 #ifdef CONFIG_NUMA
-	for_each_online_node(node)
+	for_each_node_state(node, N_MEMORY)
 		if (nodes[node])
 			x += sprintf(buf + x, " N%d=%lu",
 					node, nodes[node]);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH 08/14] Uncached allocator: Handle memoryless nodes
  2007-07-27 19:43 [PATCH 00/14] NUMA: Memoryless node support V4 Lee Schermerhorn
                   ` (6 preceding siblings ...)
  2007-07-27 19:44 ` [PATCH 07/14] Memoryless nodes: SLUB support Lee Schermerhorn
@ 2007-07-27 19:44 ` Lee Schermerhorn
  2007-07-27 19:44 ` [PATCH 09/14] Memoryless node: Allow profiling data to fall back to other nodes Lee Schermerhorn
                   ` (8 subsequent siblings)
  16 siblings, 0 replies; 68+ messages in thread
From: Lee Schermerhorn @ 2007-07-27 19:44 UTC (permalink / raw)
  To: linux-mm
  Cc: ak, Lee Schermerhorn, Nishanth Aravamudan, pj, kxr,
	Christoph Lameter, Mel Gorman, akpm, KAMEZAWA Hiroyuki

[patch 08/14] Uncached allocator: Handle memoryless nodes

The checks for node_online in the uncached allocator are made to make sure
that memory is available on these nodes. Thus switch all the checks to use
the node_memory and for_each_memory_node functions.


Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Jes Sorensen <jes@sgi.com>
Acked-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Acked-by: Bob Picco <bob.picco@hp.com>

 arch/ia64/kernel/uncached.c |    4 ++--
 drivers/char/mspec.c        |    2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

Index: Linux/arch/ia64/kernel/uncached.c
===================================================================
--- Linux.orig/arch/ia64/kernel/uncached.c	2007-07-08 19:32:17.000000000 -0400
+++ Linux/arch/ia64/kernel/uncached.c	2007-07-25 11:37:41.000000000 -0400
@@ -196,7 +196,7 @@ unsigned long uncached_alloc_page(int st
 	nid = starting_nid;
 
 	do {
-		if (!node_online(nid))
+		if (!node_state(nid, N_MEMORY))
 			continue;
 		uc_pool = &uncached_pools[nid];
 		if (uc_pool->pool == NULL)
@@ -268,7 +268,7 @@ static int __init uncached_init(void)
 {
 	int nid;
 
-	for_each_online_node(nid) {
+	for_each_node_state(nid, N_ONLINE) {
 		uncached_pools[nid].pool = gen_pool_create(PAGE_SHIFT, nid);
 		mutex_init(&uncached_pools[nid].add_chunk_mutex);
 	}
Index: Linux/drivers/char/mspec.c
===================================================================
--- Linux.orig/drivers/char/mspec.c	2007-07-25 09:29:43.000000000 -0400
+++ Linux/drivers/char/mspec.c	2007-07-25 11:37:41.000000000 -0400
@@ -344,7 +344,7 @@ mspec_init(void)
 		is_sn2 = 1;
 		if (is_shub2()) {
 			ret = -ENOMEM;
-			for_each_online_node(nid) {
+			for_each_node_state(nid, N_ONLINE) {
 				int actual_nid;
 				int nasid;
 				unsigned long phys;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH 09/14] Memoryless node: Allow profiling data to fall back to other nodes
  2007-07-27 19:43 [PATCH 00/14] NUMA: Memoryless node support V4 Lee Schermerhorn
                   ` (7 preceding siblings ...)
  2007-07-27 19:44 ` [PATCH 08/14] Uncached allocator: Handle memoryless nodes Lee Schermerhorn
@ 2007-07-27 19:44 ` Lee Schermerhorn
  2007-07-27 19:44 ` [PATCH 10/14] Memoryless nodes: Update memory policy and page migration Lee Schermerhorn
                   ` (7 subsequent siblings)
  16 siblings, 0 replies; 68+ messages in thread
From: Lee Schermerhorn @ 2007-07-27 19:44 UTC (permalink / raw)
  To: linux-mm
  Cc: ak, Lee Schermerhorn, Nishanth Aravamudan, pj, kxr,
	Christoph Lameter, Mel Gorman, akpm, KAMEZAWA Hiroyuki

[patch 09/14] Memoryless node: Allow profiling data to fall back to other nodes

Processors on memoryless nodes must be able to fall back to remote nodes
in order to get a profiling buffer. This may lead to excessive NUMA traffic
but I think we should allow this rather than failing.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Acked-by: Nishanth Aravamudan <nacc@us.ibm.com>
Acked-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Acked-by: Bob Picco <bob.picco@hp.com>

Index: linux-2.6.22-rc4-mm2/kernel/profile.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/kernel/profile.c	2007-06-13 23:36:42.000000000 -0700
+++ linux-2.6.22-rc4-mm2/kernel/profile.c	2007-06-13 23:36:55.000000000 -0700
@@ -346,7 +346,7 @@ static int __devinit profile_cpu_callbac
 		per_cpu(cpu_profile_flip, cpu) = 0;
 		if (!per_cpu(cpu_profile_hits, cpu)[1]) {
 			page = alloc_pages_node(node,
-					GFP_KERNEL | __GFP_ZERO | GFP_THISNODE,
+					GFP_KERNEL | __GFP_ZERO,
 					0);
 			if (!page)
 				return NOTIFY_BAD;
@@ -354,7 +354,7 @@ static int __devinit profile_cpu_callbac
 		}
 		if (!per_cpu(cpu_profile_hits, cpu)[0]) {
 			page = alloc_pages_node(node,
-					GFP_KERNEL | __GFP_ZERO | GFP_THISNODE,
+					GFP_KERNEL | __GFP_ZERO,
 					0);
 			if (!page)
 				goto out_free;

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH 10/14] Memoryless nodes: Update memory policy and page migration
  2007-07-27 19:43 [PATCH 00/14] NUMA: Memoryless node support V4 Lee Schermerhorn
                   ` (8 preceding siblings ...)
  2007-07-27 19:44 ` [PATCH 09/14] Memoryless node: Allow profiling data to fall back to other nodes Lee Schermerhorn
@ 2007-07-27 19:44 ` Lee Schermerhorn
  2007-07-27 19:44 ` [PATCH 11/14] Add N_CPU node state Lee Schermerhorn
                   ` (6 subsequent siblings)
  16 siblings, 0 replies; 68+ messages in thread
From: Lee Schermerhorn @ 2007-07-27 19:44 UTC (permalink / raw)
  To: linux-mm
  Cc: ak, Lee Schermerhorn, Nishanth Aravamudan, pj, kxr,
	Christoph Lameter, Mel Gorman, akpm, KAMEZAWA Hiroyuki

[patch 10/14] Memoryless nodes: Update memory policy and page migration

Online nodes now may have no memory. The checks and initialization must
therefore be changed to no longer use the online functions.

This will correctly initialize the interleave on bootup to only target
nodes with memory and will make sys_move_pages return an error when a page
is to be moved to a memoryless node. Similarly we will get an error if
MPOL_BIND and MPOL_INTERLEAVE is used on a memoryless node.

These are somewhat new semantics. So far one could specify memoryless nodes
and we would maybe do the right thing and just ignore the node (or we'd do
something strange like with MPOL_INTERLEAVE). If we want to allow the
specification of memoryless nodes via memory policies then we need to keep
checking for online nodes.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Acked-by: Nishanth Aravamudan <nacc@us.ibm.com>
Tested-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>
Acked-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Acked-by: Bob Picco <bob.picco@hp.com>

 mm/mempolicy.c |   10 +++++-----
 mm/migrate.c   |    2 +-
 2 files changed, 6 insertions(+), 6 deletions(-)

Index: Linux/mm/migrate.c
===================================================================
--- Linux.orig/mm/migrate.c	2007-07-25 11:36:22.000000000 -0400
+++ Linux/mm/migrate.c	2007-07-25 11:37:45.000000000 -0400
@@ -979,7 +979,7 @@ asmlinkage long sys_move_pages(pid_t pid
 				goto out;
 
 			err = -ENODEV;
-			if (!node_online(node))
+			if (!node_state(node, N_MEMORY))
 				goto out;
 
 			err = -EACCES;
Index: Linux/mm/mempolicy.c
===================================================================
--- Linux.orig/mm/mempolicy.c	2007-07-25 11:36:30.000000000 -0400
+++ Linux/mm/mempolicy.c	2007-07-25 11:37:45.000000000 -0400
@@ -494,9 +494,9 @@ static void get_zonemask(struct mempolic
 		*nodes = p->v.nodes;
 		break;
 	case MPOL_PREFERRED:
-		/* or use current node instead of online map? */
+		/* or use current node instead of memory_map? */
 		if (p->v.preferred_node < 0)
-			*nodes = node_online_map;
+			*nodes = node_states[N_MEMORY];
 		else
 			node_set(p->v.preferred_node, *nodes);
 		break;
@@ -1616,7 +1616,7 @@ void __init numa_policy_init(void)
 	 * fall back to the largest node if they're all smaller.
 	 */
 	nodes_clear(interleave_nodes);
-	for_each_online_node(nid) {
+	for_each_node_state(nid, N_MEMORY) {
 		unsigned long total_pages = node_present_pages(nid);
 
 		/* Preserve the largest node */
@@ -1896,7 +1896,7 @@ int show_numa_map(struct seq_file *m, vo
 		seq_printf(m, " huge");
 	} else {
 		check_pgd_range(vma, vma->vm_start, vma->vm_end,
-				&node_online_map, MPOL_MF_STATS, md);
+				&node_states[N_MEMORY], MPOL_MF_STATS, md);
 	}
 
 	if (!md->pages)
@@ -1923,7 +1923,7 @@ int show_numa_map(struct seq_file *m, vo
 	if (md->writeback)
 		seq_printf(m," writeback=%lu", md->writeback);
 
-	for_each_online_node(n)
+	for_each_node_state(n, N_MEMORY)
 		if (md->node[n])
 			seq_printf(m, " N%d=%lu", n, md->node[n]);
 out:

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH 11/14] Add N_CPU node state
  2007-07-27 19:43 [PATCH 00/14] NUMA: Memoryless node support V4 Lee Schermerhorn
                   ` (9 preceding siblings ...)
  2007-07-27 19:44 ` [PATCH 10/14] Memoryless nodes: Update memory policy and page migration Lee Schermerhorn
@ 2007-07-27 19:44 ` Lee Schermerhorn
  2007-07-27 19:44 ` [PATCH 12/14] Memoryless nodes: Fix GFP_THISNODE behavior Lee Schermerhorn
                   ` (5 subsequent siblings)
  16 siblings, 0 replies; 68+ messages in thread
From: Lee Schermerhorn @ 2007-07-27 19:44 UTC (permalink / raw)
  To: linux-mm
  Cc: ak, Lee Schermerhorn, Nishanth Aravamudan, pj, kxr,
	Christoph Lameter, Mel Gorman, akpm, KAMEZAWA Hiroyuki

[patch 11/14] Add N_CPU node state

We need the check for a node with cpu in zone reclaim. Zone reclaim will not
allow remote zone reclaim if a node has a cpu.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Tested-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>
Acked-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Acked-by: Bob Picco <bob.picco@hp.com>

 include/linux/nodemask.h |    1 +
 mm/page_alloc.c          |    4 +++-
 mm/vmscan.c              |    4 +---
 3 files changed, 5 insertions(+), 4 deletions(-)

Index: Linux/include/linux/nodemask.h
===================================================================
--- Linux.orig/include/linux/nodemask.h	2007-07-25 11:36:27.000000000 -0400
+++ Linux/include/linux/nodemask.h	2007-07-25 11:37:48.000000000 -0400
@@ -344,6 +344,7 @@ enum node_states {
 	N_POSSIBLE,	/* The node could become online at some point */
 	N_ONLINE,	/* The node is online */
 	N_MEMORY,	/* The node has memory */
+	N_CPU,		/* The node has cpus */
 	NR_NODE_STATES
 };
 
Index: Linux/mm/vmscan.c
===================================================================
--- Linux.orig/mm/vmscan.c	2007-07-25 11:36:35.000000000 -0400
+++ Linux/mm/vmscan.c	2007-07-25 11:37:48.000000000 -0400
@@ -1836,7 +1836,6 @@ static int __zone_reclaim(struct zone *z
 
 int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 {
-	cpumask_t mask;
 	int node_id;
 
 	/*
@@ -1873,8 +1872,7 @@ int zone_reclaim(struct zone *zone, gfp_
 	 * as wide as possible.
 	 */
 	node_id = zone_to_nid(zone);
-	mask = node_to_cpumask(node_id);
-	if (!cpus_empty(mask) && node_id != numa_node_id())
+	if (node_state(node_id, N_CPU) && node_id != numa_node_id())
 		return 0;
 	return __zone_reclaim(zone, gfp_mask, order);
 }
Index: Linux/mm/page_alloc.c
===================================================================
--- Linux.orig/mm/page_alloc.c	2007-07-25 11:36:27.000000000 -0400
+++ Linux/mm/page_alloc.c	2007-07-25 11:37:48.000000000 -0400
@@ -2723,6 +2723,7 @@ static struct per_cpu_pageset boot_pages
 static int __cpuinit process_zones(int cpu)
 {
 	struct zone *zone, *dzone;
+	int node = cpu_to_node(cpu);
 
 	for_each_zone(zone) {
 
@@ -2730,7 +2731,7 @@ static int __cpuinit process_zones(int c
 			continue;
 
 		zone_pcp(zone, cpu) = kmalloc_node(sizeof(struct per_cpu_pageset),
-					 GFP_KERNEL, cpu_to_node(cpu));
+					 GFP_KERNEL, node);
 		if (!zone_pcp(zone, cpu))
 			goto bad;
 
@@ -2741,6 +2742,7 @@ static int __cpuinit process_zones(int c
 			 	(zone->present_pages / percpu_pagelist_fraction));
 	}
 
+	node_set_state(node, N_CPU);
 	return 0;
 bad:
 	for_each_zone(dzone) {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH 12/14] Memoryless nodes: Fix GFP_THISNODE behavior
  2007-07-27 19:43 [PATCH 00/14] NUMA: Memoryless node support V4 Lee Schermerhorn
                   ` (10 preceding siblings ...)
  2007-07-27 19:44 ` [PATCH 11/14] Add N_CPU node state Lee Schermerhorn
@ 2007-07-27 19:44 ` Lee Schermerhorn
  2007-07-27 19:44 ` [PATCH 13/14] Memoryless Nodes: use "node_memory_map" for cpusets Lee Schermerhorn
                   ` (4 subsequent siblings)
  16 siblings, 0 replies; 68+ messages in thread
From: Lee Schermerhorn @ 2007-07-27 19:44 UTC (permalink / raw)
  To: linux-mm
  Cc: ak, Lee Schermerhorn, Nishanth Aravamudan, pj, kxr,
	Christoph Lameter, Mel Gorman, akpm, KAMEZAWA Hiroyuki

[patch 12/14] Memoryless nodes: Fix GFP_THISNODE behavior

GFP_THISNODE checks that the zone selected is within the pgdat (node) of the
first zone of a nodelist. That only works if the node has memory. A
memoryless node will have its first node on another pgdat (node).

GFP_THISNODE currently will return simply memory on the first pgdat.
Thus it is returning memory on other nodes. GFP_THISNODE should fail
if there is no local memory on a node.


Add a new set of zonelists for each node that only contain the nodes
that belong to the zones itself so that no fallback is possible.

Then modify gfp_type to pickup the right zone based on the presence
of __GFP_THISNODE.

Drop the existing GFP_THISNODE checks from the page_allocators hot path.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Acked-by: Nishanth Aravamudan <nacc@us.ibm.com>
Tested-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>
Acked-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Acked-by: Bob Picco <bob.picco@hp.com>

 include/linux/gfp.h    |   16 +++++++++++-----
 include/linux/mmzone.h |   14 +++++++++++++-
 mm/page_alloc.c        |   28 +++++++++++++++++++++++-----
 3 files changed, 47 insertions(+), 11 deletions(-)

Index: Linux/include/linux/gfp.h
===================================================================
--- Linux.orig/include/linux/gfp.h	2007-07-25 09:29:50.000000000 -0400
+++ Linux/include/linux/gfp.h	2007-07-25 11:37:52.000000000 -0400
@@ -116,22 +116,28 @@ static inline int allocflags_to_migratet
 
 static inline enum zone_type gfp_zone(gfp_t flags)
 {
+	int base = 0;
+
+#ifdef CONFIG_NUMA
+	if (flags & __GFP_THISNODE)
+		base = MAX_NR_ZONES;
+#endif
 #ifdef CONFIG_ZONE_DMA
 	if (flags & __GFP_DMA)
-		return ZONE_DMA;
+		return base + ZONE_DMA;
 #endif
 #ifdef CONFIG_ZONE_DMA32
 	if (flags & __GFP_DMA32)
-		return ZONE_DMA32;
+		return base + ZONE_DMA32;
 #endif
 	if ((flags & (__GFP_HIGHMEM | __GFP_MOVABLE)) ==
 			(__GFP_HIGHMEM | __GFP_MOVABLE))
-		return ZONE_MOVABLE;
+		return base + ZONE_MOVABLE;
 #ifdef CONFIG_HIGHMEM
 	if (flags & __GFP_HIGHMEM)
-		return ZONE_HIGHMEM;
+		return base + ZONE_HIGHMEM;
 #endif
-	return ZONE_NORMAL;
+	return base + ZONE_NORMAL;
 }
 
 static inline gfp_t set_migrateflags(gfp_t gfp, gfp_t migrate_flags)
Index: Linux/mm/page_alloc.c
===================================================================
--- Linux.orig/mm/page_alloc.c	2007-07-25 11:37:48.000000000 -0400
+++ Linux/mm/page_alloc.c	2007-07-25 11:37:52.000000000 -0400
@@ -1433,9 +1433,6 @@ zonelist_scan:
 			!zlc_zone_worth_trying(zonelist, z, allowednodes))
 				continue;
 		zone = *z;
-		if (unlikely(NUMA_BUILD && (gfp_mask & __GFP_THISNODE) &&
-			zone->zone_pgdat != zonelist->zones[0]->zone_pgdat))
-				break;
 		if ((alloc_flags & ALLOC_CPUSET) &&
 			!cpuset_zone_allowed_softwall(zone, gfp_mask))
 				goto try_next_zone;
@@ -1560,7 +1557,10 @@ restart:
 	z = zonelist->zones;  /* the list of zones suitable for gfp_mask */
 
 	if (unlikely(*z == NULL)) {
-		WARN_ON_ONCE(1);
+		/*
+		 * Happens if we have an empty zonelist as a result of
+		 * GFP_THISNODE being used on a memoryless node
+		 */
 		return NULL;
 	}
 
@@ -2159,6 +2159,22 @@ static void build_zonelists_in_node_orde
 }
 
 /*
+ * Build gfp_thisnode zonelists
+ */
+static void build_thisnode_zonelists(pg_data_t *pgdat)
+{
+	enum zone_type i;
+	int j;
+	struct zonelist *zonelist;
+
+	for (i = 0; i < MAX_NR_ZONES; i++) {
+		zonelist = pgdat->node_zonelists + MAX_NR_ZONES + i;
+ 		j = build_zonelists_node(pgdat, zonelist, 0, i);
+		zonelist->zones[j] = NULL;
+	}
+}
+
+/*
  * Build zonelists ordered by zone and nodes within zones.
  * This results in conserving DMA zone[s] until all Normal memory is
  * exhausted, but results in overflowing to remote node while memory
@@ -2262,7 +2278,7 @@ static void build_zonelists(pg_data_t *p
 	int order = current_zonelist_order;
 
 	/* initialize zonelists */
-	for (i = 0; i < MAX_NR_ZONES; i++) {
+	for (i = 0; i < MAX_ZONELISTS; i++) {
 		zonelist = pgdat->node_zonelists + i;
 		zonelist->zones[0] = NULL;
 	}
@@ -2307,6 +2323,8 @@ static void build_zonelists(pg_data_t *p
 		/* calculate node order -- i.e., DMA last! */
 		build_zonelists_in_zone_order(pgdat, j);
 	}
+
+	build_thisnode_zonelists(pgdat);
 }
 
 /* Construct the zonelist performance cache - see further mmzone.h */
Index: Linux/include/linux/mmzone.h
===================================================================
--- Linux.orig/include/linux/mmzone.h	2007-07-25 09:29:50.000000000 -0400
+++ Linux/include/linux/mmzone.h	2007-07-25 11:37:52.000000000 -0400
@@ -357,6 +357,17 @@ struct zone {
 #define MAX_ZONES_PER_ZONELIST (MAX_NUMNODES * MAX_NR_ZONES)
 
 #ifdef CONFIG_NUMA
+
+/*
+ * The NUMA zonelists are doubled becausse we need zonelists that restrict the
+ * allocations to a single node for GFP_THISNODE.
+ *
+ * [0 .. MAX_NR_ZONES -1] 		: Zonelists with fallback
+ * [MAZ_NR_ZONES ... MAZ_ZONELISTS -1]  : No fallback (GFP_THISNODE)
+ */
+#define MAX_ZONELISTS (2 * MAX_NR_ZONES)
+
+
 /*
  * We cache key information from each zonelist for smaller cache
  * footprint when scanning for free pages in get_page_from_freelist().
@@ -422,6 +433,7 @@ struct zonelist_cache {
 	unsigned long last_full_zap;		/* when last zap'd (jiffies) */
 };
 #else
+#define MAX_ZONELISTS MAX_NR_ZONES
 struct zonelist_cache;
 #endif
 
@@ -470,7 +482,7 @@ extern struct page *mem_map;
 struct bootmem_data;
 typedef struct pglist_data {
 	struct zone node_zones[MAX_NR_ZONES];
-	struct zonelist node_zonelists[MAX_NR_ZONES];
+	struct zonelist node_zonelists[MAX_ZONELISTS];
 	int nr_zones;
 #ifdef CONFIG_FLAT_NODE_MEM_MAP
 	struct page *node_mem_map;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH 13/14] Memoryless Nodes:  use "node_memory_map" for cpusets
  2007-07-27 19:43 [PATCH 00/14] NUMA: Memoryless node support V4 Lee Schermerhorn
                   ` (11 preceding siblings ...)
  2007-07-27 19:44 ` [PATCH 12/14] Memoryless nodes: Fix GFP_THISNODE behavior Lee Schermerhorn
@ 2007-07-27 19:44 ` Lee Schermerhorn
  2007-07-27 19:44 ` [PATCH 14/14] Memoryless nodes: drop one memoryless node boot warning Lee Schermerhorn
                   ` (3 subsequent siblings)
  16 siblings, 0 replies; 68+ messages in thread
From: Lee Schermerhorn @ 2007-07-27 19:44 UTC (permalink / raw)
  To: linux-mm
  Cc: ak, Lee Schermerhorn, Nishanth Aravamudan, pj, kxr,
	Christoph Lameter, Mel Gorman, akpm, KAMEZAWA Hiroyuki

[patch 13/14] Memoryless Nodes:  use "node_memory_map" for cpusets - take 4

Against 2.6.22-rc1-mm1 atop Christoph Lameter's memoryless nodes
series

take 2:
+ replaced node_online_map in cpuset_current_mems_allowed()
  with node_states[N_MEMORY]
+ replaced node_online_map in cpuset_init_smp() with
  node_states[N_MEMORY]

take 3:
+ fix up comments and top level cpuset tracking of nodes
  with memory [instead of on-line nodes]

take 4:
+ fix typo in !CPUSETS definition of cpuset_current_mems_allowed()
+ fix up Documentation/cpusets.txt to reflect these changes.

cpusets try to ensure that any node added to a cpuset's 
mems_allowed is on-line and contains memory.  The assumption
was that online nodes contained memory.  Thus, it is possible
to add memoryless nodes to a cpuset and then add tasks to this
cpuset.  This results in continuous series of oom-kill and
apparent system hang.

Change cpusets to use node_states[N_MEMORY] [a.k.a.
node_memory_map] in place of node_online_map when vetting 
memories.  Return error if admin attempts to write a non-empty
mems_allowed node mask containing only memoryless-nodes.

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Bob Picco <bob.picco@hp.com>
Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
Tested-by: Nishanth Aravamudan <nacc@us.ibm.com>

	Tested on 4-node ppc64 with 2 memoryless nodes. Top cpuset
	(and all subsequent ones) only allow nodes 0 and 1 (the
	nodes with memory).

 Documentation/cpusets.txt |    8 ++++---
 include/linux/cpuset.h    |    2 -
 kernel/cpuset.c           |   51 +++++++++++++++++++++++++++++-----------------
 3 files changed, 39 insertions(+), 22 deletions(-)

Index: Linux/kernel/cpuset.c
===================================================================
--- Linux.orig/kernel/cpuset.c	2007-07-26 12:40:16.000000000 -0400
+++ Linux/kernel/cpuset.c	2007-07-26 12:55:29.000000000 -0400
@@ -307,26 +307,26 @@ static void guarantee_online_cpus(const 
 
 /*
  * Return in *pmask the portion of a cpusets's mems_allowed that
- * are online.  If none are online, walk up the cpuset hierarchy
- * until we find one that does have some online mems.  If we get
- * all the way to the top and still haven't found any online mems,
- * return node_online_map.
+ * are online, with memory.  If none are online with memory, walk
+ * up the cpuset hierarchy until we find one that does have some
+ * online mems.  If we get all the way to the top and still haven't
+ * found any online mems, return node_states[N_MEMORY].
  *
  * One way or another, we guarantee to return some non-empty subset
- * of node_online_map.
+ * of node_states[N_MEMORY].
  *
  * Call with callback_mutex held.
  */
 
 static void guarantee_online_mems(const struct cpuset *cs, nodemask_t *pmask)
 {
-	while (cs && !nodes_intersects(cs->mems_allowed, node_online_map))
+	while (cs && !nodes_intersects(cs->mems_allowed, node_states[N_MEMORY]))
 		cs = cs->parent;
 	if (cs)
-		nodes_and(*pmask, cs->mems_allowed, node_online_map);
+		nodes_and(*pmask, cs->mems_allowed, node_states[N_MEMORY]);
 	else
-		*pmask = node_online_map;
-	BUG_ON(!nodes_intersects(*pmask, node_online_map));
+		*pmask = node_states[N_MEMORY];
+	BUG_ON(!nodes_intersects(*pmask, node_states[N_MEMORY]));
 }
 
 /**
@@ -597,7 +597,7 @@ static int update_nodemask(struct cpuset
 	int retval;
 	struct container_iter it;
 
-	/* top_cpuset.mems_allowed tracks node_online_map; it's read-only */
+	/* top_cpuset.mems_allowed tracks node_states[N_MEMORY]; it's read-only */
 	if (cs == &top_cpuset)
 		return -EACCES;
 
@@ -614,8 +614,21 @@ static int update_nodemask(struct cpuset
 		retval = nodelist_parse(buf, trialcs.mems_allowed);
 		if (retval < 0)
 			goto done;
+		if (!nodes_intersects(trialcs.mems_allowed,
+						node_states[N_MEMORY])) {
+			/*
+			 * error if only memoryless nodes specified.
+			 */
+			retval = -ENOSPC;
+			goto done;
+		}
 	}
-	nodes_and(trialcs.mems_allowed, trialcs.mems_allowed, node_online_map);
+	/*
+	 * Exclude memoryless nodes.  We know that trialcs.mems_allowed
+	 * contains at least one node with memory.
+	 */
+	nodes_and(trialcs.mems_allowed, trialcs.mems_allowed,
+						node_states[N_MEMORY]);
 	oldmem = cs->mems_allowed;
 	if (nodes_equal(oldmem, trialcs.mems_allowed)) {
 		retval = 0;		/* Too easy - nothing to do */
@@ -1356,8 +1369,9 @@ static void guarantee_online_cpus_mems_i
 
 /*
  * The cpus_allowed and mems_allowed nodemasks in the top_cpuset track
- * cpu_online_map and node_online_map.  Force the top cpuset to track
- * whats online after any CPU or memory node hotplug or unplug event.
+ * cpu_online_map and node_states[N_MEMORY].  Force the top cpuset to
+ * track what's online after any CPU or memory node hotplug or unplug
+ * event.
  *
  * To ensure that we don't remove a CPU or node from the top cpuset
  * that is currently in use by a child cpuset (which would violate
@@ -1377,7 +1391,7 @@ static void common_cpu_mem_hotplug_unplu
 
 	guarantee_online_cpus_mems_in_subtree(&top_cpuset);
 	top_cpuset.cpus_allowed = cpu_online_map;
-	top_cpuset.mems_allowed = node_online_map;
+	top_cpuset.mems_allowed = node_states[N_MEMORY];
 
 	mutex_unlock(&callback_mutex);
 	container_unlock();
@@ -1405,8 +1419,9 @@ static int cpuset_handle_cpuhp(struct no
 
 #ifdef CONFIG_MEMORY_HOTPLUG
 /*
- * Keep top_cpuset.mems_allowed tracking node_online_map.
- * Call this routine anytime after you change node_online_map.
+ * Keep top_cpuset.mems_allowed tracking node_states[N_MEMORY].
+ * Call this routine anytime after you change
+ * node_states[N_MEMORY].
  * See also the previous routine cpuset_handle_cpuhp().
  */
 
@@ -1425,7 +1440,7 @@ void cpuset_track_online_nodes(void)
 void __init cpuset_init_smp(void)
 {
 	top_cpuset.cpus_allowed = cpu_online_map;
-	top_cpuset.mems_allowed = node_online_map;
+	top_cpuset.mems_allowed = node_states[N_MEMORY];
 
 	hotcpu_notifier(cpuset_handle_cpuhp, 0);
 }
@@ -1465,7 +1480,7 @@ void cpuset_init_current_mems_allowed(vo
  *
  * Description: Returns the nodemask_t mems_allowed of the cpuset
  * attached to the specified @tsk.  Guaranteed to return some non-empty
- * subset of node_online_map, even if this means going outside the
+ * subset of node_states[N_MEMORY], even if this means going outside the
  * tasks cpuset.
  **/
 
Index: Linux/include/linux/cpuset.h
===================================================================
--- Linux.orig/include/linux/cpuset.h	2007-07-26 12:40:16.000000000 -0400
+++ Linux/include/linux/cpuset.h	2007-07-26 12:55:30.000000000 -0400
@@ -92,7 +92,7 @@ static inline nodemask_t cpuset_mems_all
 	return node_possible_map;
 }
 
-#define cpuset_current_mems_allowed (node_online_map)
+#define cpuset_current_mems_allowed (node_states[N_MEMORY])
 static inline void cpuset_init_current_mems_allowed(void) {}
 static inline void cpuset_update_task_memory_state(void) {}
 #define cpuset_nodes_subset_current_mems_allowed(nodes) (1)
Index: Linux/Documentation/cpusets.txt
===================================================================
--- Linux.orig/Documentation/cpusets.txt	2007-07-25 09:29:48.000000000 -0400
+++ Linux/Documentation/cpusets.txt	2007-07-26 13:02:00.000000000 -0400
@@ -8,6 +8,7 @@ Portions Copyright (c) 2004-2006 Silicon
 Modified by Paul Jackson <pj@sgi.com>
 Modified by Christoph Lameter <clameter@sgi.com>
 Modified by Paul Menage <menage@google.com>
+Modified by Lee Schermerhorn <lee.schermerhorn@hp.com>
 
 CONTENTS:
 =========
@@ -35,7 +36,8 @@ CONTENTS:
 ----------------------
 
 Cpusets provide a mechanism for assigning a set of CPUs and Memory
-Nodes to a set of tasks.
+Nodes to a set of tasks.   In this document "Memory Node" refers to
+an on-line node that contains memory.
 
 Cpusets constrain the CPU and Memory placement of tasks to only
 the resources within a tasks current cpuset.  They form a nested
@@ -207,8 +209,8 @@ and name space for cpusets, with a minim
 The cpus and mems files in the root (top_cpuset) cpuset are
 read-only.  The cpus file automatically tracks the value of
 cpu_online_map using a CPU hotplug notifier, and the mems file
-automatically tracks the value of node_online_map using the
-cpuset_track_online_nodes() hook.
+automatically tracks the value of node_states[N_MEMORY]--i.e.,
+nodes with memory--using the cpuset_track_online_nodes() hook.
 
 
 1.4 What are exclusive cpusets ?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH 14/14] Memoryless nodes:  drop one memoryless node boot warning
  2007-07-27 19:43 [PATCH 00/14] NUMA: Memoryless node support V4 Lee Schermerhorn
                   ` (12 preceding siblings ...)
  2007-07-27 19:44 ` [PATCH 13/14] Memoryless Nodes: use "node_memory_map" for cpusets Lee Schermerhorn
@ 2007-07-27 19:44 ` Lee Schermerhorn
  2007-07-27 20:59 ` [PATCH 00/14] NUMA: Memoryless node support V4 Nishanth Aravamudan
                   ` (2 subsequent siblings)
  16 siblings, 0 replies; 68+ messages in thread
From: Lee Schermerhorn @ 2007-07-27 19:44 UTC (permalink / raw)
  To: linux-mm
  Cc: ak, Lee Schermerhorn, Nishanth Aravamudan, pj, kxr,
	Christoph Lameter, Mel Gorman, akpm, KAMEZAWA Hiroyuki

[patch 14/14] Memoryless nodes:  drop one memoryless node boot warning

get_pfn_range_for_nid() is called multiple times for each node
at boot time.  Each time, it will warn about nodes with no
memory, resulting in boot messages like:

	Node 0 active with no memory
	Node 0 active with no memory
	Node 0 active with no memory
	Node 0 active with no memory
	Node 0 active with no memory
	Node 0 active with no memory
	On node 0 totalpages: 0
	Node 0 active with no memory
	Node 0 active with no memory
	  DMA zone: 0 pages used for memmap
	Node 0 active with no memory
	Node 0 active with no memory
	  Normal zone: 0 pages used for memmap
	Node 0 active with no memory
	Node 0 active with no memory
	  Movable zone: 0 pages used for memmap

and so on for each memoryless node.

We already have the "On node N totalpages: ..." and other 
related messages, so drop the "Node N active with no memory"
warnings.

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

 mm/page_alloc.c |    4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

Index: Linux/mm/page_alloc.c
===================================================================
--- Linux.orig/mm/page_alloc.c	2007-07-26 12:34:15.000000000 -0400
+++ Linux/mm/page_alloc.c	2007-07-26 12:35:26.000000000 -0400
@@ -3097,10 +3097,8 @@ void __meminit get_pfn_range_for_nid(uns
 		*end_pfn = max(*end_pfn, early_node_map[i].end_pfn);
 	}
 
-	if (*start_pfn == -1UL) {
-		printk(KERN_WARNING "Node %u active with no memory\n", nid);
+	if (*start_pfn == -1UL)
 		*start_pfn = 0;
-	}
 
 	/* Push the node boundaries out if requested */
 	account_node_boundary(nid, start_pfn, end_pfn);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 00/14] NUMA: Memoryless node support V4
  2007-07-27 19:43 [PATCH 00/14] NUMA: Memoryless node support V4 Lee Schermerhorn
                   ` (13 preceding siblings ...)
  2007-07-27 19:44 ` [PATCH 14/14] Memoryless nodes: drop one memoryless node boot warning Lee Schermerhorn
@ 2007-07-27 20:59 ` Nishanth Aravamudan
  2007-07-30 13:48   ` Lee Schermerhorn
  2007-07-29 12:35 ` Paul Jackson
  2007-07-30 21:19 ` Nishanth Aravamudan
  16 siblings, 1 reply; 68+ messages in thread
From: Nishanth Aravamudan @ 2007-07-27 20:59 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, ak, pj, kxr, Christoph Lameter, Mel Gorman, akpm,
	KAMEZAWA Hiroyuki

On 27.07.2007 [15:43:16 -0400], Lee Schermerhorn wrote:
> Changes V3->V4:
> - Refresh against 23-rc1-mm1
> - teach cpusets about memoryless nodes.
> 
> Changes V2->V3:
> - Refresh patches (sigh)
> - Add comments suggested by Kamezawa Hiroyuki
> - Add signoff by Jes Sorensen
> 
> Changes V1->V2:
> - Add a generic layer that allows the definition of additional node bitmaps

Are you carrying this stack anywhere publicly? Like in a git tree or
even just big patch format?

Thanks,
Nish, who will rebase on top of this set

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 00/14] NUMA: Memoryless node support V4
  2007-07-27 19:43 [PATCH 00/14] NUMA: Memoryless node support V4 Lee Schermerhorn
                   ` (14 preceding siblings ...)
  2007-07-27 20:59 ` [PATCH 00/14] NUMA: Memoryless node support V4 Nishanth Aravamudan
@ 2007-07-29 12:35 ` Paul Jackson
  2007-07-30 16:07   ` Lee Schermerhorn
  2007-07-30 21:19 ` Nishanth Aravamudan
  16 siblings, 1 reply; 68+ messages in thread
From: Paul Jackson @ 2007-07-29 12:35 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, ak, nacc, kxr, clameter, mel, akpm, kamezawa.hiroyu

Lee,

What is the motivation for memoryless nodes?  I'm not sure what I
mean by that question -- perhaps the answer involves describing a
piece of hardware, perhaps a somewhat hypothetical piece of hardware
if the real hardware is proprietary.  But usually adding new mechanisms
to the kernel should involve explaining why it is needed.

In this case, it might further involve explaining why we need memoryless
nodes, as opposed to say a hack for the above (hypothetical?) hardware
in question that pretends that any CPUs on such memoryless nodes are on
the nearest memory equipped node -- and then entirely drops the idea of
memoryless nodes.  Most likely you have good reason not to go this way.
Good chance even you've already explained this, and I missed it.

===

I have user level code that scans the 'cpu%d' entries below the
/sys/devices/system/node%d directories, and then inverts the resulting
<node, cpu> map, in order to provide, for any given cpu the nearest
node.  This code is a simple form of node and cpu topology for user
code that wants to setup cpusets with cpus and nodes 'near' each other.

Could you post the results, from such a (possibly hypothetical) machine,
of the following two commands:

  find /sys/devices/system/node* -name cpu[0-9]\*
  ls /sys/devices/system/cpu

And if the 'ls' shows cpus that the 'find' doesn't show, then can you
recommend how user code should be written that would return, for any
specified cpu (even one on a memoryless node) the number of the
'nearest' node that does have memory (for some plausible definition,
your choice pretty much, of 'nearest')?

Granted, this is not a pressing issue ... not much chance that my user
code will be running on your (hypothetical?) hardware anytime soon,
unless there is some deal in the works I don't know about for hp to
buy sgi ;).

In short, how should user code find 'nearby' memory nodes for cpus that
are on memoryless nodes?

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 00/14] NUMA: Memoryless node support V4
  2007-07-27 20:59 ` [PATCH 00/14] NUMA: Memoryless node support V4 Nishanth Aravamudan
@ 2007-07-30 13:48   ` Lee Schermerhorn
  0 siblings, 0 replies; 68+ messages in thread
From: Lee Schermerhorn @ 2007-07-30 13:48 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: linux-mm, ak, pj, kxr, Christoph Lameter, Mel Gorman, akpm,
	KAMEZAWA Hiroyuki

On Fri, 2007-07-27 at 13:59 -0700, Nishanth Aravamudan wrote:
> On 27.07.2007 [15:43:16 -0400], Lee Schermerhorn wrote:
> > Changes V3->V4:
> > - Refresh against 23-rc1-mm1
> > - teach cpusets about memoryless nodes.
> > 
> > Changes V2->V3:
> > - Refresh patches (sigh)
> > - Add comments suggested by Kamezawa Hiroyuki
> > - Add signoff by Jes Sorensen
> > 
> > Changes V1->V2:
> > - Add a generic layer that allows the definition of additional node bitmaps
> 
> Are you carrying this stack anywhere publicly? Like in a git tree or
> even just big patch format?


Sorry.  Christoph did ask me to do this, but I booked out of here on
Friday w/o doing so.  Tarball now at:

http://free.linux.hp.com/~lts/Patches/MemlessNodes/

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 00/14] NUMA: Memoryless node support V4
  2007-07-29 12:35 ` Paul Jackson
@ 2007-07-30 16:07   ` Lee Schermerhorn
  2007-07-30 18:56     ` Paul Jackson
  0 siblings, 1 reply; 68+ messages in thread
From: Lee Schermerhorn @ 2007-07-30 16:07 UTC (permalink / raw)
  To: Paul Jackson
  Cc: linux-mm, ak, nacc, kxr, clameter, mel, akpm, kamezawa.hiroyu

On Sun, 2007-07-29 at 05:35 -0700, Paul Jackson wrote:
> Lee,
> 
> What is the motivation for memoryless nodes?  I'm not sure what I
> mean by that question -- perhaps the answer involves describing a
> piece of hardware, perhaps a somewhat hypothetical piece of hardware
> if the real hardware is proprietary.  But usually adding new mechanisms
> to the kernel should involve explaining why it is needed.

Hi, Paul.

My motivation for working on the memoryless nodes patches is to properly
support all configurations of our hardware.  We can configure our
platforms with from 0% to 100% "cell local memory" [CLM].  We also call
0% CLM "fully interleaved", as it the hardware interleaves the memory on
a cache line granularity.  Our AMD-based x86_64 platforms have a similar
feature, altho' it's "all or nothing" on these platforms.  I believe the
Fujitsu ia64 platform supports a similar feature.

One could reasonably ask why we have this feature.  My understanding is
that certain OSes supported on this hardware were not very "NUMA-aware"
when the hardware was released--Linux, included.  Hardware interleaving
smoothed out the "hot spots" and made it possible to run reasonably well
on the platform.  This did leave some performance "on the table", as
Linux has demonstrated in recent releases.  Linux now performs better
for some workloads, like AIM7, in 100% CLM mode.  This was not the case
a year or two ago.

A couple of other details for completeness:  Like SGI platforms, on our
platforms, cell local memory shows up at some ridiculously high physical
address, altho' maybe not so ridiculous as the Altix ;-).  Interleaved
memory shows up at physical address 0.  I understand that the
architecture requires some memory at phys addr 0.  For this reason, even
when we configure 100% CLM, we still get a "small" amount of interleaved
memory--512M on my 4-node test system

I should also mention that when the HP-UX group runs the TPC-C benchmark
for reporting, they find that a mixture of cell local and interleaved
memory provides the best performance.  I don't know the details of how
they lay out the benchmark on this config, but I need to find out for
Linux testing...

Anyway, in 0% CLM/fully-interleaved mode, our platform looks like this:

available: 5 nodes (0-4)
node 0 size: 0 MB
node 0 free: 0 MB
node 1 size: 0 MB
node 1 free: 0 MB
node 2 size: 0 MB
node 2 free: 0 MB
node 3 size: 0 MB
node 3 free: 0 MB
node 4 size: 8191 MB <= interleaved at phys addr 0
node 4 free: 105 MB  <= was running a test...

If I configure for 100% CLM and boot with mem=16G [on a 32G platform], I
get:

available: 5 nodes (0-4)
node 0 size: 7600 MB
node 0 free: 6647 MB
node 1 size: 8127 MB
node 1 free: 7675 MB
node 2 size: 144 MB
node 2 free: 94 MB
node 3 size: 0 MB
node 3 free: 0 MB
node 4 size: 511 MB <= interleaved @ phys addr 0
node 4 free: 494 MB

both configs include memoryless nodes.

> In this case, it might further involve explaining why we need memoryless
> nodes, as opposed to say a hack for the above (hypothetical?) hardware
> in question that pretends that any CPUs on such memoryNo, wless nodes are on
> the nearest memory equipped node -- and then entirely drops the idea of
> memoryless nodes.  Most likely you have good reason not to go this way.
> Good chance even you've already explained this, and I missed it.

No, I haven't explained it.  Christoph posted the original memoryless
nodes patch set in response to prompting from Andrew.  He considered
failure to support memoryless nodes a bug.  The system "sort of" worked
because for most allocations, the zonelists allow the memoryless nodes
immediately "fall back" to a node with memory.  There were a few corner
cases that Christoph's series address.

I believe that the x86_64 kernel works as you suggest in fully
interleaved mode.  All memory shows up on node zero in the SRAT, and all
cpus are attached to this node.

For my part, given that our platforms can be configured in a couple of
ways, I would prefer that cpus not change their node association based
on the configuration.  But, that's just me...  I know one shouldn't make
any assumptions about cpu-to-node association.  Rather, we have the
libnuma APIs to query this information.  Still... why go there?

And then there's the fact that on some platforms, ours included, all
nodes with memory are not equal.  See my recent patch to allow selected
nodes to be excluded from interleave policy.  I don't want to exclude
these nodes from cpusets to achieve this, because there are cases [like
the TPC-C benchmark mentioned above] where we want the application to be
able to use the funky, interleaved memory, but only when requested
explicitly.  IMO, Christoph's generic nodemask mechanism makes it easy
to handle nodes with special characteristics--no memory, excluded from
interleave, ...--in a generic way.

> 
> ===
> 
> I have user level code that scans the 'cpu%d' entries below the
> /sys/devices/system/node%d directories, and then inverts the resulting
> <node, cpu> map, in order to provide, for any given cpu the nearest
> node.  This code is a simple form of node and cpu topology for user
> code that wants to setup cpusets with cpus and nodes 'near' each other.

Sounds useful for an administrator partitioning the machine.  I can see
why you might need it with the size of your systems ;-).  And, for our
platform in fully interleaved mode--even tho' there is only one node
with memory to choose from.  Is this part of the SGI ProPack?

> 
> Could you post the results, from such a (possibly hypothetical) machine,
> of the following two commands:
> 
>   find /sys/devices/system/node* -name cpu[0-9]\*
>   ls /sys/devices/system/cpu
> 
> And if the 'ls' shows cpus that the 'find' doesn't show, then can you
> recommend how user code should be written that would return, for any
> specified cpu (even one on a memoryless node) the number of the
> 'nearest' node that does have memory (for some plausible definition,
> your choice pretty much, of 'nearest')?

I verified that I see all cpus [16 on the 4-node, 16 cpu ia64 platform
I'm testing on], either way:  find or ls [w/ and w/o cell local
memory].  

> 
> Granted, this is not a pressing issue ... not much chance that my user
> code will be running on your (hypothetical?) hardware anytime soon,
> unless there is some deal in the works I don't know about for hp to
> buy sgi ;).
> 
> In short, how should user code find 'nearby' memory nodes for cpus that
> are on memoryless nodes?

Again, on the fully interleaved config, there is only one node with
memory, so it's not hard.  And in the 100% CLM, with mem=<less that 100%
of existing memory> [2nd config above], the SLIT says that the
interleaved pseudo-node is closer to any real node than any other real
node--based on the average latency.  The interleaved node is always the
highest numbered node.  Mileage may vary on other platforms...

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 00/14] NUMA: Memoryless node support V4
  2007-07-30 16:07   ` Lee Schermerhorn
@ 2007-07-30 18:56     ` Paul Jackson
  0 siblings, 0 replies; 68+ messages in thread
From: Paul Jackson @ 2007-07-30 18:56 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, ak, nacc, kxr, clameter, mel, akpm, kamezawa.hiroyu

Hmmm ... good explanation, Lee.  Thanks.

I started to think about it more carefully,
but guess I'd better get back to my vacation
for this week.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 00/14] NUMA: Memoryless node support V4
  2007-07-27 19:43 [PATCH 00/14] NUMA: Memoryless node support V4 Lee Schermerhorn
                   ` (15 preceding siblings ...)
  2007-07-29 12:35 ` Paul Jackson
@ 2007-07-30 21:19 ` Nishanth Aravamudan
  2007-07-30 22:06   ` Christoph Lameter
  16 siblings, 1 reply; 68+ messages in thread
From: Nishanth Aravamudan @ 2007-07-30 21:19 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, ak, pj, kxr, Christoph Lameter, Mel Gorman, akpm,
	KAMEZAWA Hiroyuki, apw

On 27.07.2007 [15:43:16 -0400], Lee Schermerhorn wrote:
> Changes V3->V4:
> - Refresh against 23-rc1-mm1
> - teach cpusets about memoryless nodes.
> 
> Changes V2->V3:
> - Refresh patches (sigh)
> - Add comments suggested by Kamezawa Hiroyuki
> - Add signoff by Jes Sorensen
> 
> Changes V1->V2:
> - Add a generic layer that allows the definition of additional node bitmaps
> 
> This patchset is implementing additional node bitmaps that allow the system
> to track nodes that are online without memory and nodes that have processors.

Ok, submitted a bunch of jobs to just touch test this stack. Found two
issues:

On moe, a NUMA-Q box (part of test.kernel.org), I didn't see the same
panic that Andy reported, instead I got:

------------[ cut here ]------------
kernel BUG at mm/slub.c:1895!
invalid opcode: 0000 [#1]
SMP 
Modules linked in:
CPU:    0
EIP:    0060:[<c105d0a8>]    Not tainted VLI
EFLAGS: 00010046   (2.6.23-rc1-mm1-autokern1 #1)
EIP is at early_kmem_cache_node_alloc+0x2b/0x8d
eax: 00000000   ebx: 00000001   ecx: d38014e4   edx: c12c3a60
esi: 00000000   edi: 00000001   ebp: 000000d0   esp: c1343f3c
ds: 007b   es: 007b   fs: 00d8  gs: 0000  ss: 0068
Process swapper (pid: 0, ti=c1342000 task=c12c3a60 task.ti=c1342000)
Stack: 00000001 c133c3e0 00000000 c105d20c 00000000 c133c3e0 00000000 000000d0 
       c105d5e7 00000003 c1343fa4 000000d0 00010e56 c1343fa4 c1276826 00000003 
       00000055 c127b3c1 00000000 000000d0 c133c3e0 0000001c c105d86e 0000001c 
Call Trace:
 [<c105d20c>] init_kmem_cache_nodes+0x8f/0xdd
 [<c105d5e7>] kmem_cache_open+0x86/0xdd
 [<c105d86e>] create_kmalloc_cache+0x51/0xa7
 [<c135a483>] kmem_cache_init+0x50/0x16e
 [<c101b72a>] printk+0x16/0x19
 [<c1355855>] test_wp_bit+0x7e/0x81
 [<c1348966>] start_kernel+0x19f/0x21c
 [<c13483d8>] unknown_bootoption+0x0/0x139
 =======================
Code: 83 3d e4 c3 33 c1 1b 57 89 d7 56 53 77 04 0f 0b eb fe 0d 00 12 04 00 89 d1 89 c2 b8 e0 c3 33 c1 e8 99 f5 ff ff 85 c0 89 c6 75 04 <0f> 0b eb fe 8b 58 14 85 db 75 04 0f 0b eb fe a1 ec c3 33 c1 b9 
EIP: [<c105d0a8>] early_kmem_cache_node_alloc+0x2b/0x8d SS:ESP 0068:c1343f3c
Kernel panic - not syncing: Attempted to kill the idle task!

Then, on a !NUMA ppc64 box, I got:

lloc_bootmem_core(): zero-sized request
------------[ cut here ]------------
kernel BUG at mm/bootmem.c:190!
cpu 0x0: Vector: 700 (Program Check) at [c000000000833910]
    pc: c0000000006b4644: .__alloc_bootmem_core+0x58/0x410
    lr: c0000000006b4640: .__alloc_bootmem_core+0x54/0x410
    sp: c000000000833b90
   msr: 8000000000029032
  current = 0xc0000000007276a0
  paca    = 0xc000000000728000
    pid   = 0, comm = swapper
kernel BUG at mm/bootmem.c:190!
enter ? for help
[c000000000833c50] c0000000006b4b14 .__alloc_bootmem_nopanic+0x40/0xac
[c000000000833cf0] c0000000006b4ba0 .__alloc_bootmem+0x20/0x5c
[c000000000833d70] c0000000006b56e0 .alloc_large_system_hash+0x120/0x2bc
[c000000000833e50] c0000000006b6b14 .vfs_caches_init_early+0x54/0xb4
[c000000000833ee0] c000000000694cc4 .start_kernel+0x2e8/0x3f4
[c000000000833f90] c000000000008534 .start_here_common+0x60/0x12c

I'm going to verify if the latter, at least, happens with plain
2.6.23-rc1-mm1, but wanted to get these reports out there.

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH/RFC] 2.6.23-rc1-mm1:  MPOL_PREFERRED fixups for preferred_node < 0
  2007-07-27 19:43 ` [PATCH 01/14] NUMA: Generic management of nodemasks for various purposes Lee Schermerhorn
@ 2007-07-30 21:38   ` Lee Schermerhorn
  2007-07-30 22:00     ` Lee Schermerhorn
  2007-07-31 21:05     ` [PATCH/RFC] 2.6.23-rc1-mm1: MPOL_PREFERRED fixups for preferred_node < 0 - v2 Lee Schermerhorn
  2007-08-01  2:22   ` [PATCH 01/14] NUMA: Generic management of nodemasks for various purposes Andrew Morton
  1 sibling, 2 replies; 68+ messages in thread
From: Lee Schermerhorn @ 2007-07-30 21:38 UTC (permalink / raw)
  To: linux-mm
  Cc: ak, Nishanth Aravamudan, pj, kxr, Christoph Lameter, Mel Gorman,
	akpm, KAMEZAWA Hiroyuki

These are some "issues" that I came across working on the Memoryless
Node series.  I'm using the same cc: list as that series as the issues
are somewhat related.

Only boot tested at this point.

Comments?

Lee

---------------------------

PATCH/RFC - MPOL_PREFERRED fixups for "local allocation"

Here are a couple of potential "fixups" for MPOL_PREFERRED behavior
when v.preferred_node < 0 -- i.e., "local allocation":

1)  [do_]get_mempolicy() calls the misnamed get_zonemask() to fetch the
    nodemask associated with a policy.  Currently, get_zonemask() returns
    the set of nodes with memory, when the policy 'mode' is 'PREFERRED,
    and the preferred_node is < 0.  Return the set of allowed nodes
    instead.  This will already have been masked to include only nodes
    with memory.

2)  When a task is moved into a [new] cpuset, mpol_rebind_policy() is
    called to adjust any task and vma policy nodes to be valid in the
    new cpuset.  However, when the policy is MPOL_PREFERRED, and the
    preferred_node is <0, no rebind is necessary.  The "local allocation"
    indication is valid in any cpuset.

3)  mpol_to_str() produces a printable, "human readable" string from a
    struct mempolicy.  For MPOL_PREFERRED with preferred_node <0,  show
    the entire set of valid nodes.  Although, technically, MPOL_PREFERRED
    takes only a single node, preferred_node <0 is a local allocation policy,
    with the preferred node determined by the context where the task
    is executing.  All of the allowed nodes are possible, as the task
    migrates amoung the nodes in the cpuset.  Indeed, the task/vma may have
    memory from any of the allowed nodes.

Comments?

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

 mm/mempolicy.c |   28 ++++++++++++++++++++++------
 1 file changed, 22 insertions(+), 6 deletions(-)

Index: Linux/mm/mempolicy.c
===================================================================
--- Linux.orig/mm/mempolicy.c	2007-07-30 16:18:27.000000000 -0400
+++ Linux/mm/mempolicy.c	2007-07-30 16:37:19.000000000 -0400
@@ -494,9 +494,11 @@ static void get_zonemask(struct mempolic
 		*nodes = p->v.nodes;
 		break;
 	case MPOL_PREFERRED:
-		/* or use current node instead of memory_map? */
+		/*
+		 * for "local policy", return allowed memories
+		 */
 		if (p->v.preferred_node < 0)
-			*nodes = node_states[N_MEMORY];
+			*nodes = cpuset_current_mems_allowed;
 		else
 			node_set(p->v.preferred_node, *nodes);
 		break;
@@ -1650,6 +1652,7 @@ void mpol_rebind_policy(struct mempolicy
 {
 	nodemask_t *mpolmask;
 	nodemask_t tmp;
+	int nid;
 
 	if (!pol)
 		return;
@@ -1668,9 +1671,15 @@ void mpol_rebind_policy(struct mempolicy
 						*mpolmask, *newmask);
 		break;
 	case MPOL_PREFERRED:
-		pol->v.preferred_node = node_remap(pol->v.preferred_node,
+		/*
+		 * no need to remap "local policy"
+		 */
+		nid = pol->v.preferred_node;
+		if (nid >= 0) {
+			pol->v.preferred_node = node_remap(nid,
 						*mpolmask, *newmask);
-		*mpolmask = *newmask;
+			*mpolmask = *newmask;
+		}
 		break;
 	case MPOL_BIND: {
 		nodemask_t nodes;
@@ -1745,7 +1754,7 @@ static const char * const policy_types[]
 static inline int mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol)
 {
 	char *p = buffer;
-	int l;
+	int nid, l;
 	nodemask_t nodes;
 	int mode = pol ? pol->policy : MPOL_DEFAULT;
 
@@ -1756,7 +1765,14 @@ static inline int mpol_to_str(char *buff
 
 	case MPOL_PREFERRED:
 		nodes_clear(nodes);
-		node_set(pol->v.preferred_node, nodes);
+		nid = pol->v.preferred_node;
+		/*
+		 * local interleave, show all valid nodes
+		 */
+		if (nid < 0 )
+			nodes_or(nodes, cpuset_current_mems_allowed);
+		else
+			node_set(nid, nodes);
 		break;
 
 	case MPOL_BIND:



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH/RFC] 2.6.23-rc1-mm1:  MPOL_PREFERRED fixups for preferred_node < 0
  2007-07-30 21:38   ` [PATCH/RFC] 2.6.23-rc1-mm1: MPOL_PREFERRED fixups for preferred_node < 0 Lee Schermerhorn
@ 2007-07-30 22:00     ` Lee Schermerhorn
  2007-07-31 15:32       ` Mel Gorman
  2007-07-31 21:05     ` [PATCH/RFC] 2.6.23-rc1-mm1: MPOL_PREFERRED fixups for preferred_node < 0 - v2 Lee Schermerhorn
  1 sibling, 1 reply; 68+ messages in thread
From: Lee Schermerhorn @ 2007-07-30 22:00 UTC (permalink / raw)
  To: linux-mm
  Cc: ak, Nishanth Aravamudan, pj, kxr, Christoph Lameter, Mel Gorman,
	akpm, KAMEZAWA Hiroyuki

On Mon, 2007-07-30 at 17:38 -0400, Lee Schermerhorn wrote:
> These are some "issues" that I came across working on the Memoryless
> Node series.  I'm using the same cc: list as that series as the issues
> are somewhat related.
> 
> Only boot tested at this point.

I sent the wrong patch--forgot to refresh before posting :-(.  Bogus
code in mpol_to_str() in previous patch.

Try this one.

Lee

> ---------------------------

PATCH/RFC - MPOL_PREFERRED fixups for "local allocation"

Here are a couple of potential "fixups" for MPOL_PREFERRED behavior
when v.preferred_node < 0 -- i.e., "local allocation":

1)  [do_]get_mempolicy() calls the misnamed get_zonemask() to fetch the
    nodemask associated with a policy.  Currently, get_zonemask() returns
    the set of nodes with memory, when the policy 'mode' is 'PREFERRED,
    and the preferred_node is < 0.  Return the set of allowed nodes
    instead.  This will already have been masked to include only nodes
    with memory.

2)  When a task is moved into a [new] cpuset, mpol_rebind_policy() is
    called to adjust any task and vma policy nodes to be valid in the
    new cpuset.  However, when the policy is MPOL_PREFERRED, and the
    preferred_node is <0, no rebind is necessary.  The "local allocation"
    indication is valid in any cpuset.

3)  mpol_to_str() produces a printable, "human readable" string from a
    struct mempolicy.  For MPOL_PREFERRED with preferred_node <0,  show
    the entire set of valid nodes.  Although, technically, MPOL_PREFERRED
    takes only a single node, preferred_node <0 is a local allocation policy,
    with the preferred node determined by the context where the task
    is executing.  All of the allowed nodes are possible, as the task
    migrates amoung the nodes in the cpuset.

Comments?

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

 mm/mempolicy.c |   31 ++++++++++++++++++++++++-------
 1 file changed, 24 insertions(+), 7 deletions(-)

Index: Linux/mm/mempolicy.c
===================================================================
--- Linux.orig/mm/mempolicy.c	2007-07-30 17:32:06.000000000 -0400
+++ Linux/mm/mempolicy.c	2007-07-30 17:38:17.000000000 -0400
@@ -494,9 +494,11 @@ static void get_zonemask(struct mempolic
 		*nodes = p->v.nodes;
 		break;
 	case MPOL_PREFERRED:
-		/* or use current node instead of memory_map? */
+		/*
+		 * for "local policy", return allowed memories
+		 */
 		if (p->v.preferred_node < 0)
-			*nodes = node_states[N_MEMORY];
+			*nodes = cpuset_current_mems_allowed;
 		else
 			node_set(p->v.preferred_node, *nodes);
 		break;
@@ -1650,6 +1652,7 @@ void mpol_rebind_policy(struct mempolicy
 {
 	nodemask_t *mpolmask;
 	nodemask_t tmp;
+	int nid;
 
 	if (!pol)
 		return;
@@ -1668,9 +1671,15 @@ void mpol_rebind_policy(struct mempolicy
 						*mpolmask, *newmask);
 		break;
 	case MPOL_PREFERRED:
-		pol->v.preferred_node = node_remap(pol->v.preferred_node,
+		/*
+		 * no need to remap "local policy"
+		 */
+		nid = pol->v.preferred_node;
+		if (nid >= 0) {
+			pol->v.preferred_node = node_remap(nid,
 						*mpolmask, *newmask);
-		*mpolmask = *newmask;
+			*mpolmask = *newmask;
+		}
 		break;
 	case MPOL_BIND: {
 		nodemask_t nodes;
@@ -1745,7 +1754,7 @@ static const char * const policy_types[]
 static inline int mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol)
 {
 	char *p = buffer;
-	int l;
+	int nid, l;
 	nodemask_t nodes;
 	int mode = pol ? pol->policy : MPOL_DEFAULT;
 
@@ -1755,8 +1764,16 @@ static inline int mpol_to_str(char *buff
 		break;
 
 	case MPOL_PREFERRED:
-		nodes_clear(nodes);
-		node_set(pol->v.preferred_node, nodes);
+		nid = pol->v.preferred_node;
+		/*
+		 * local interleave, show all valid nodes
+		 */
+		if (nid < 0 )
+			nodes = cpuset_current_mems_allowed;
+		else {
+			nodes_clear(nodes);
+			node_set(nid, nodes);
+		}
 		break;
 
 	case MPOL_BIND:


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 00/14] NUMA: Memoryless node support V4
  2007-07-30 21:19 ` Nishanth Aravamudan
@ 2007-07-30 22:06   ` Christoph Lameter
  2007-07-30 22:35     ` Andi Kleen
  0 siblings, 1 reply; 68+ messages in thread
From: Christoph Lameter @ 2007-07-30 22:06 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Lee Schermerhorn, linux-mm, ak, pj, kxr, Mel Gorman, akpm,
	KAMEZAWA Hiroyuki, apw

On Mon, 30 Jul 2007, Nishanth Aravamudan wrote:

> On moe, a NUMA-Q box (part of test.kernel.org), I didn't see the same
> panic that Andy reported, instead I got:
> 
> ------------[ cut here ]------------
> kernel BUG at mm/slub.c:1895!
> invalid opcode: 0000 [#1]
> SMP 
> Modules linked in:
> CPU:    0
> EIP:    0060:[<c105d0a8>]    Not tainted VLI
> EFLAGS: 00010046   (2.6.23-rc1-mm1-autokern1 #1)
> EIP is at early_kmem_cache_node_alloc+0x2b/0x8d
> eax: 00000000   ebx: 00000001   ecx: d38014e4   edx: c12c3a60
> esi: 00000000   edi: 00000001   ebp: 000000d0   esp: c1343f3c
> ds: 007b   es: 007b   fs: 00d8  gs: 0000  ss: 0068
> Process swapper (pid: 0, ti=c1342000 task=c12c3a60 task.ti=c1342000)
> Stack: 00000001 c133c3e0 00000000 c105d20c 00000000 c133c3e0 00000000 000000d0 
>        c105d5e7 00000003 c1343fa4 000000d0 00010e56 c1343fa4 c1276826 00000003 
>        00000055 c127b3c1 00000000 000000d0 c133c3e0 0000001c c105d86e 0000001c 
> Call Trace:
>  [<c105d20c>] init_kmem_cache_nodes+0x8f/0xdd
>  [<c105d5e7>] kmem_cache_open+0x86/0xdd
>  [<c105d86e>] create_kmalloc_cache+0x51/0xa7
>  [<c135a483>] kmem_cache_init+0x50/0x16e
>  [<c101b72a>] printk+0x16/0x19
>  [<c1355855>] test_wp_bit+0x7e/0x81
>  [<c1348966>] start_kernel+0x19f/0x21c
>  [<c13483d8>] unknown_bootoption+0x0/0x139
>  =======================
> Code: 83 3d e4 c3 33 c1 1b 57 89 d7 56 53 77 04 0f 0b eb fe 0d 00 12 04 00 89 d1 89 c2 b8 e0 c3 33 c1 e8 99 f5 ff ff 85 c0 89 c6 75 04 <0f> 0b eb fe 8b 58 14 85 db 75 04 0f 0b eb fe a1 ec c3 33 c1 b9 
> EIP: [<c105d0a8>] early_kmem_cache_node_alloc+0x2b/0x8d SS:ESP 0068:c1343f3c
> Kernel panic - not syncing: Attempted to kill the idle task!

Hmmm... yes trouble with NUMAQ is that the nodes only have HIGHMEM 
but no NORMAL memory. The memory is not available to the slab allocator 
(needs ZONE_NORMAL memory) and we cannot fall back anymore. We may need 
something like N_SLAB that defines the allowed nodes for the slab 
allocators. Sigh.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 00/14] NUMA: Memoryless node support V4
  2007-07-30 22:06   ` Christoph Lameter
@ 2007-07-30 22:35     ` Andi Kleen
  2007-07-30 22:36       ` Christoph Lameter
  0 siblings, 1 reply; 68+ messages in thread
From: Andi Kleen @ 2007-07-30 22:35 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Nishanth Aravamudan, Lee Schermerhorn, linux-mm, pj, kxr,
	Mel Gorman, akpm, KAMEZAWA Hiroyuki, apw

> Hmmm... yes trouble with NUMAQ is that the nodes only have HIGHMEM 
> but no NORMAL memory. The memory is not available to the slab allocator 
> (needs ZONE_NORMAL memory) and we cannot fall back anymore. We may need 
> something like N_SLAB that defines the allowed nodes for the slab 
> allocators. Sigh.

Or just disable 32bit NUMA. The arch/i386 numa code is beyond ugly anyways
and I don't think it ever worked particularly well.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 00/14] NUMA: Memoryless node support V4
  2007-07-30 22:35     ` Andi Kleen
@ 2007-07-30 22:36       ` Christoph Lameter
  2007-07-31 23:18         ` Nishanth Aravamudan
  0 siblings, 1 reply; 68+ messages in thread
From: Christoph Lameter @ 2007-07-30 22:36 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Nishanth Aravamudan, Lee Schermerhorn, linux-mm, pj, kxr,
	Mel Gorman, akpm, KAMEZAWA Hiroyuki, apw

On Tue, 31 Jul 2007, Andi Kleen wrote:

> 
> > Hmmm... yes trouble with NUMAQ is that the nodes only have HIGHMEM 
> > but no NORMAL memory. The memory is not available to the slab allocator 
> > (needs ZONE_NORMAL memory) and we cannot fall back anymore. We may need 
> > something like N_SLAB that defines the allowed nodes for the slab 
> > allocators. Sigh.
> 
> Or just disable 32bit NUMA. The arch/i386 numa code is beyond ugly anyways
> and I don't think it ever worked particularly well.

So we would no longer support NUMAQ? Is that possible?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH/RFC] 2.6.23-rc1-mm1:  MPOL_PREFERRED fixups for preferred_node < 0
  2007-07-30 22:00     ` Lee Schermerhorn
@ 2007-07-31 15:32       ` Mel Gorman
  2007-07-31 15:58         ` Lee Schermerhorn
  0 siblings, 1 reply; 68+ messages in thread
From: Mel Gorman @ 2007-07-31 15:32 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, ak, Nishanth Aravamudan, pj, kxr, Christoph Lameter,
	akpm, KAMEZAWA Hiroyuki

On (30/07/07 18:00), Lee Schermerhorn didst pronounce:
> On Mon, 2007-07-30 at 17:38 -0400, Lee Schermerhorn wrote:
> > These are some "issues" that I came across working on the Memoryless
> > Node series.  I'm using the same cc: list as that series as the issues
> > are somewhat related.
> > 
> > Only boot tested at this point.
> 
> I sent the wrong patch--forgot to refresh before posting :-(.  Bogus
> code in mpol_to_str() in previous patch.
> 
> Try this one.
> 
> Lee
> 
> > ---------------------------
> 
> PATCH/RFC - MPOL_PREFERRED fixups for "local allocation"
> 
> Here are a couple of potential "fixups" for MPOL_PREFERRED behavior
> when v.preferred_node < 0 -- i.e., "local allocation":
> 
> 1)  [do_]get_mempolicy() calls the misnamed get_zonemask() to fetch the
>     nodemask associated with a policy.  Currently, get_zonemask() returns
>     the set of nodes with memory, when the policy 'mode' is 'PREFERRED,

Consider a cleanup that renames get_zonemask because the naming is
misleading.

>     and the preferred_node is < 0.  Return the set of allowed nodes
>     instead.  This will already have been masked to include only nodes
>     with memory.
> 
> 2)  When a task is moved into a [new] cpuset, mpol_rebind_policy() is
>     called to adjust any task and vma policy nodes to be valid in the
>     new cpuset.  However, when the policy is MPOL_PREFERRED, and the
>     preferred_node is <0, no rebind is necessary.  The "local allocation"
>     indication is valid in any cpuset.
> 
> 3)  mpol_to_str() produces a printable, "human readable" string from a
>     struct mempolicy.  For MPOL_PREFERRED with preferred_node <0,  show
>     the entire set of valid nodes.  Although, technically, MPOL_PREFERRED
>     takes only a single node, preferred_node <0 is a local allocation policy,
>     with the preferred node determined by the context where the task
>     is executing.  All of the allowed nodes are possible, as the task
>     migrates amoung the nodes in the cpuset.
> 
> Comments?
> 
> Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>
> 
>  mm/mempolicy.c |   31 ++++++++++++++++++++++++-------
>  1 file changed, 24 insertions(+), 7 deletions(-)
> 
> Index: Linux/mm/mempolicy.c
> ===================================================================
> --- Linux.orig/mm/mempolicy.c	2007-07-30 17:32:06.000000000 -0400
> +++ Linux/mm/mempolicy.c	2007-07-30 17:38:17.000000000 -0400
> @@ -494,9 +494,11 @@ static void get_zonemask(struct mempolic
>  		*nodes = p->v.nodes;
>  		break;
>  	case MPOL_PREFERRED:
> -		/* or use current node instead of memory_map? */
> +		/*
> +		 * for "local policy", return allowed memories
> +		 */
>  		if (p->v.preferred_node < 0)
> -			*nodes = node_states[N_MEMORY];
> +			*nodes = cpuset_current_mems_allowed;
>  		else

Is this actually a bugfix? From this context, it looks like memory
policies using MPOL_PREFERRED can ignore cpusets.

>  			node_set(p->v.preferred_node, *nodes);
>  		break;
> @@ -1650,6 +1652,7 @@ void mpol_rebind_policy(struct mempolicy
>  {
>  	nodemask_t *mpolmask;
>  	nodemask_t tmp;
> +	int nid;
>  
>  	if (!pol)
>  		return;
> @@ -1668,9 +1671,15 @@ void mpol_rebind_policy(struct mempolicy
>  						*mpolmask, *newmask);
>  		break;
>  	case MPOL_PREFERRED:
> -		pol->v.preferred_node = node_remap(pol->v.preferred_node,
> +		/*
> +		 * no need to remap "local policy"
> +		 */
> +		nid = pol->v.preferred_node;
> +		if (nid >= 0) {
> +			pol->v.preferred_node = node_remap(nid,
>  						*mpolmask, *newmask);
> -		*mpolmask = *newmask;
> +			*mpolmask = *newmask;
> +		}
>  		break;
>  	case MPOL_BIND: {
>  		nodemask_t nodes;
> @@ -1745,7 +1754,7 @@ static const char * const policy_types[]
>  static inline int mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol)
>  {
>  	char *p = buffer;
> -	int l;
> +	int nid, l;
>  	nodemask_t nodes;
>  	int mode = pol ? pol->policy : MPOL_DEFAULT;
>  
> @@ -1755,8 +1764,16 @@ static inline int mpol_to_str(char *buff
>  		break;
>  
>  	case MPOL_PREFERRED:
> -		nodes_clear(nodes);
> -		node_set(pol->v.preferred_node, nodes);
> +		nid = pol->v.preferred_node;
> +		/*
> +		 * local interleave, show all valid nodes
> +		 */
> +		if (nid < 0 )
> +			nodes = cpuset_current_mems_allowed;
> +		else {
> +			nodes_clear(nodes);
> +			node_set(nid, nodes);
> +		}
>  		break;
>  
>  	case MPOL_BIND:
> 

-- 
-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH/RFC] 2.6.23-rc1-mm1:  MPOL_PREFERRED fixups for preferred_node < 0
  2007-07-31 15:32       ` Mel Gorman
@ 2007-07-31 15:58         ` Lee Schermerhorn
  0 siblings, 0 replies; 68+ messages in thread
From: Lee Schermerhorn @ 2007-07-31 15:58 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, ak, Nishanth Aravamudan, pj, kxr, Christoph Lameter,
	akpm, KAMEZAWA Hiroyuki

On Tue, 2007-07-31 at 16:32 +0100, Mel Gorman wrote:
> On (30/07/07 18:00), Lee Schermerhorn didst pronounce:
> > On Mon, 2007-07-30 at 17:38 -0400, Lee Schermerhorn wrote:
> > > These are some "issues" that I came across working on the Memoryless
> > > Node series.  I'm using the same cc: list as that series as the issues
> > > are somewhat related.
> > > 
> > > Only boot tested at this point.
> > 
> > I sent the wrong patch--forgot to refresh before posting :-(.  Bogus
> > code in mpol_to_str() in previous patch.
> > 
> > Try this one.
> > 
> > Lee
> > 
> > > ---------------------------
> > 
> > PATCH/RFC - MPOL_PREFERRED fixups for "local allocation"
> > 
> > Here are a couple of potential "fixups" for MPOL_PREFERRED behavior
> > when v.preferred_node < 0 -- i.e., "local allocation":
> > 
> > 1)  [do_]get_mempolicy() calls the misnamed get_zonemask() to fetch the
> >     nodemask associated with a policy.  Currently, get_zonemask() returns
> >     the set of nodes with memory, when the policy 'mode' is 'PREFERRED,
> 
> Consider a cleanup that renames get_zonemask because the naming is
> misleading.

I can do that.  Wanted to hear from others, such as yourself first.

> 
> >     and the preferred_node is < 0.  Return the set of allowed nodes
> >     instead.  This will already have been masked to include only nodes
> >     with memory.
> > 
> > 2)  When a task is moved into a [new] cpuset, mpol_rebind_policy() is
> >     called to adjust any task and vma policy nodes to be valid in the
> >     new cpuset.  However, when the policy is MPOL_PREFERRED, and the
> >     preferred_node is <0, no rebind is necessary.  The "local allocation"
> >     indication is valid in any cpuset.
> > 
> > 3)  mpol_to_str() produces a printable, "human readable" string from a
> >     struct mempolicy.  For MPOL_PREFERRED with preferred_node <0,  show
> >     the entire set of valid nodes.  Although, technically, MPOL_PREFERRED
> >     takes only a single node, preferred_node <0 is a local allocation policy,
> >     with the preferred node determined by the context where the task
> >     is executing.  All of the allowed nodes are possible, as the task
> >     migrates amoung the nodes in the cpuset.
> > 
> > Comments?
> > 
> > Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>
> > 
> >  mm/mempolicy.c |   31 ++++++++++++++++++++++++-------
> >  1 file changed, 24 insertions(+), 7 deletions(-)
> > 
> > Index: Linux/mm/mempolicy.c
> > ===================================================================
> > --- Linux.orig/mm/mempolicy.c	2007-07-30 17:32:06.000000000 -0400
> > +++ Linux/mm/mempolicy.c	2007-07-30 17:38:17.000000000 -0400
> > @@ -494,9 +494,11 @@ static void get_zonemask(struct mempolic
> >  		*nodes = p->v.nodes;
> >  		break;
> >  	case MPOL_PREFERRED:
> > -		/* or use current node instead of memory_map? */
> > +		/*
> > +		 * for "local policy", return allowed memories
> > +		 */
> >  		if (p->v.preferred_node < 0)
> > -			*nodes = node_states[N_MEMORY];
> > +			*nodes = cpuset_current_mems_allowed;
> >  		else
> 
> Is this actually a bugfix? From this context, it looks like memory
> policies using MPOL_PREFERRED can ignore cpusets.

Not a serious bug, if it is one.  More of a cleanup.   All this does is
return a node mask in the case where the application has a task memory
policy of 'PREFERRED with a node id of -1 [which happens when you
specify an empty nodemask to set_mempolicy() or mbind()].  This means
"local allocation"--the actual "current node id" is fetched at
allocation time.  This is a little know "feature" of get_mempolicy().
The results is misleading, but there isn't much the application can do
with it.  Node masks are ANDed with cpuset_current_mems_allowed when
installed via a syscall.

> 
> >  			node_set(p->v.preferred_node, *nodes);
> >  		break;
> > @@ -1650,6 +1652,7 @@ void mpol_rebind_policy(struct mempolicy
> >  {
> >  	nodemask_t *mpolmask;
> >  	nodemask_t tmp;
> > +	int nid;
> >  
> >  	if (!pol)
> >  		return;
> > @@ -1668,9 +1671,15 @@ void mpol_rebind_policy(struct mempolicy
> >  						*mpolmask, *newmask);
> >  		break;
> >  	case MPOL_PREFERRED:
> > -		pol->v.preferred_node = node_remap(pol->v.preferred_node,

Ultimately, node_remap() [the bitmap functions it calls] will return the
old value of "-1" because it's outside the valid range for a the node
bitmasks.  However, it doesn't seem right to be calling node_remap()
with an invalid node id.  I think it's clearer this way:

> > +		/*
> > +		 * no need to remap "local policy"
> > +		 */
> > +		nid = pol->v.preferred_node;
> > +		if (nid >= 0) {
> > +			pol->v.preferred_node = node_remap(nid,
> >  						*mpolmask, *newmask);
> > -		*mpolmask = *newmask;
> > +			*mpolmask = *newmask;
> > +		}
> >  		break;
> >  	case MPOL_BIND: {
> >  		nodemask_t nodes;
> > @@ -1745,7 +1754,7 @@ static const char * const policy_types[]
> >  static inline int mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol)
> >  {
> >  	char *p = buffer;
> > -	int l;
> > +	int nid, l;
> >  	nodemask_t nodes;
> >  	int mode = pol ? pol->policy : MPOL_DEFAULT;
> >  
> > @@ -1755,8 +1764,16 @@ static inline int mpol_to_str(char *buff
> >  		break;
> >  
> >  	case MPOL_PREFERRED:
> > -		nodes_clear(nodes);
> > -		node_set(pol->v.preferred_node, nodes);

Here, I think set_bit() will set bit 31.  Again, misleading, IMO.

> > +		nid = pol->v.preferred_node;
> > +		/*
> > +		 * local interleave, show all valid nodes
> > +		 */
> > +		if (nid < 0 )
> > +			nodes = cpuset_current_mems_allowed;
> > +		else {
> > +			nodes_clear(nodes);
> > +			node_set(nid, nodes);
> > +		}
> >  		break;
> >  
> >  	case MPOL_BIND:
> > 
> 
> -- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH/RFC] 2.6.23-rc1-mm1:  MPOL_PREFERRED fixups for preferred_node < 0 - v2
  2007-07-30 21:38   ` [PATCH/RFC] 2.6.23-rc1-mm1: MPOL_PREFERRED fixups for preferred_node < 0 Lee Schermerhorn
  2007-07-30 22:00     ` Lee Schermerhorn
@ 2007-07-31 21:05     ` Lee Schermerhorn
  1 sibling, 0 replies; 68+ messages in thread
From: Lee Schermerhorn @ 2007-07-31 21:05 UTC (permalink / raw)
  To: linux-mm
  Cc: ak, Nishanth Aravamudan, pj, kxr, Christoph Lameter, Mel Gorman,
	akpm, KAMEZAWA Hiroyuki, Eric Whitney

How about this?

---------------------------

PATCH/RFC - MPOL_PREFERRED fixups for "local allocation" - V2

Against: 2.6.23-rc1-mm1 atop Memoryless Node patch series

V1 -> V2:
+  renamed get_zonemask() to get_nodemask().  Mel Gorman suggested this
   was a "cleanup" I should go ahead and add.

Here are a couple of potential "fixups" for MPOL_PREFERRED behavior
when v.preferred_node < 0 -- i.e., "local allocation":

1)  [do_]get_mempolicy() calls the now renamed get_nodemask() to fetch the
    nodemask associated with a policy.  Currently, get_nodemask() returns
    the set of nodes with memory, when the policy 'mode' is 'PREFERRED,
    and the preferred_node is < 0.  Return the set of allowed nodes
    instead.  This will already have been masked to include only nodes
    with memory.

2)  When a task is moved into a [new] cpuset, mpol_rebind_policy() is
    called to adjust any task and vma policy nodes to be valid in the
    new cpuset.  However, when the policy is MPOL_PREFERRED, and the
    preferred_node is <0, no rebind is necessary.  The "local allocation"
    indication is valid in any cpuset.  Existing code will "do the right
    thing" because node_remap() will just return the argument node when
    it is outside of the valid range of node ids.  However, I think it is
    clearer and cleaner to skip the remap explicitly in this case.

3)  mpol_to_str() produces a printable, "human readable" string from a
    struct mempolicy.  For MPOL_PREFERRED with preferred_node <0,  show
    the entire set of valid nodes.  Although, technically, MPOL_PREFERRED
    takes only a single node, preferred_node <0 is a local allocation policy,
    with the preferred node determined by the context where the task
    is executing.  All of the allowed nodes are possible, as the task
    migrates amoung the nodes in the cpuset.  Without this change, I believe
    that node_set() [via set_bit()] will set bit 31, resulting in a misleading
    display.

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

 mm/mempolicy.c |   31 ++++++++++++++++++++++++-------
 1 file changed, 24 insertions(+), 7 deletions(-)

Index: Linux/mm/mempolicy.c
===================================================================
--- Linux.orig/mm/mempolicy.c	2007-07-30 17:32:06.000000000 -0400
+++ Linux/mm/mempolicy.c	2007-07-30 17:38:17.000000000 -0400
@@ -494,9 +494,11 @@ static void get_zonemask(struct mempolic
 		*nodes = p->v.nodes;
 		break;
 	case MPOL_PREFERRED:
-		/* or use current node instead of memory_map? */
+		/*
+		 * for "local policy", return allowed memories
+		 */
 		if (p->v.preferred_node < 0)
-			*nodes = node_states[N_MEMORY];
+			*nodes = cpuset_current_mems_allowed;
 		else
 			node_set(p->v.preferred_node, *nodes);
 		break;
@@ -1650,6 +1652,7 @@ void mpol_rebind_policy(struct mempolicy
 {
 	nodemask_t *mpolmask;
 	nodemask_t tmp;
+	int nid;
 
 	if (!pol)
 		return;
@@ -1668,9 +1671,15 @@ void mpol_rebind_policy(struct mempolicy
 						*mpolmask, *newmask);
 		break;
 	case MPOL_PREFERRED:
-		pol->v.preferred_node = node_remap(pol->v.preferred_node,
+		/*
+		 * no need to remap "local policy"
+		 */
+		nid = pol->v.preferred_node;
+		if (nid >= 0) {
+			pol->v.preferred_node = node_remap(nid,
 						*mpolmask, *newmask);
-		*mpolmask = *newmask;
+			*mpolmask = *newmask;
+		}
 		break;
 	case MPOL_BIND: {
 		nodemask_t nodes;
@@ -1745,7 +1754,7 @@ static const char * const policy_types[]
 static inline int mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol)
 {
 	char *p = buffer;
-	int l;
+	int nid, l;
 	nodemask_t nodes;
 	int mode = pol ? pol->policy : MPOL_DEFAULT;
 
@@ -1755,8 +1764,16 @@ static inline int mpol_to_str(char *buff
 		break;
 
 	case MPOL_PREFERRED:
-		nodes_clear(nodes);
-		node_set(pol->v.preferred_node, nodes);
+		nid = pol->v.preferred_node;
+		/*
+		 * local interleave, show all valid nodes
+		 */
+		if (nid < 0 )
+			nodes = cpuset_current_mems_allowed;
+		else {
+			nodes_clear(nodes);
+			node_set(nid, nodes);
+		}
 		break;
 
 	case MPOL_BIND:


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 00/14] NUMA: Memoryless node support V4
  2007-07-30 22:36       ` Christoph Lameter
@ 2007-07-31 23:18         ` Nishanth Aravamudan
  0 siblings, 0 replies; 68+ messages in thread
From: Nishanth Aravamudan @ 2007-07-31 23:18 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andi Kleen, Lee Schermerhorn, linux-mm, pj, kxr, Mel Gorman,
	akpm, KAMEZAWA Hiroyuki, apw

On 30.07.2007 [15:36:19 -0700], Christoph Lameter wrote:
> On Tue, 31 Jul 2007, Andi Kleen wrote:
> 
> > 
> > > Hmmm... yes trouble with NUMAQ is that the nodes only have HIGHMEM
> > > but no NORMAL memory. The memory is not available to the slab
> > > allocator (needs ZONE_NORMAL memory) and we cannot fall back
> > > anymore. We may need something like N_SLAB that defines the
> > > allowed nodes for the slab allocators. Sigh.
> > 
> > Or just disable 32bit NUMA. The arch/i386 numa code is beyond ugly
> > anyways and I don't think it ever worked particularly well.
> 
> So we would no longer support NUMAQ? Is that possible?

Seems a bit excessive in the context of these patches. The kernel worked
before this stack and doesn't with it. Then again, I guess NUMAQ only
worked because it relied on a fallback that shouldn't have happened?

I'm not sure what the best solution is -- maybe Andy has some insight?

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 01/14] NUMA: Generic management of nodemasks for various purposes
  2007-07-27 19:43 ` [PATCH 01/14] NUMA: Generic management of nodemasks for various purposes Lee Schermerhorn
  2007-07-30 21:38   ` [PATCH/RFC] 2.6.23-rc1-mm1: MPOL_PREFERRED fixups for preferred_node < 0 Lee Schermerhorn
@ 2007-08-01  2:22   ` Andrew Morton
  2007-08-01  2:52     ` Christoph Lameter
  1 sibling, 1 reply; 68+ messages in thread
From: Andrew Morton @ 2007-08-01  2:22 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, ak, Nishanth Aravamudan, pj, kxr, Christoph Lameter,
	Mel Gorman, KAMEZAWA Hiroyuki

On Fri, 27 Jul 2007 15:43:22 -0400 Lee Schermerhorn <lee.schermerhorn@hp.com> wrote:

> [patch 1/14] NUMA: Generic management of nodemasks for various purposes
> 
> Preparation for memoryless node patches.
> 
> Provide a generic way to keep nodemasks describing various characteristics
> of NUMA nodes.
> 
> Remove the node_online_map and the node_possible map and realize the whole
> thing using two nodes stats: N_POSSIBLE and N_ONLINE.
> 
> ...
>
> +#define for_each_node_state(node, __state) \
> +	for ( (node) = 0; (node) != 0; (node) = 1)

That looks weird.



This patch causes early crashes on i386.

http://userweb.kernel.org/~akpm/dsc03671.jpg
http://userweb.kernel.org/~akpm/config-vmm.txt

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 01/14] NUMA: Generic management of nodemasks for various purposes
  2007-08-01  2:22   ` [PATCH 01/14] NUMA: Generic management of nodemasks for various purposes Andrew Morton
@ 2007-08-01  2:52     ` Christoph Lameter
  2007-08-01  3:05       ` Andrew Morton
  0 siblings, 1 reply; 68+ messages in thread
From: Christoph Lameter @ 2007-08-01  2:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Lee Schermerhorn, linux-mm, ak, Nishanth Aravamudan, pj, kxr,
	Mel Gorman, KAMEZAWA Hiroyuki

On Tue, 31 Jul 2007, Andrew Morton wrote:

> >
> > +#define for_each_node_state(node, __state) \
> > +	for ( (node) = 0; (node) != 0; (node) = 1)
> 
> That looks weird.

Yup and we have committed the usual sin of not testing !NUMA.

Loop needs to be executed for node = 0 but have node = 1 on exit. We 
want to avoid increments so that the compiler can optimize better.

As it is the loop as is is not executed at all and we have node = 0 when 
the loop is done. 

---
 include/linux/nodemask.h |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6/include/linux/nodemask.h
===================================================================
--- linux-2.6.orig/include/linux/nodemask.h	2007-07-31 19:46:00.000000000 -0700
+++ linux-2.6/include/linux/nodemask.h	2007-07-31 19:46:29.000000000 -0700
@@ -404,7 +404,7 @@ static inline int num_node_state(enum no
 }
 
 #define for_each_node_state(node, __state) \
-	for ( (node) = 0; (node) != 0; (node) = 1)
+	for ( (node) = 0; (node) == 0; (node) = 1)
 
 #define first_online_node	0
 #define next_online_node(nid)	(MAX_NUMNODES)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 01/14] NUMA: Generic management of nodemasks for various purposes
  2007-08-01  2:52     ` Christoph Lameter
@ 2007-08-01  3:05       ` Andrew Morton
  2007-08-01  3:14         ` Christoph Lameter
  2007-08-01 15:25         ` Nishanth Aravamudan
  0 siblings, 2 replies; 68+ messages in thread
From: Andrew Morton @ 2007-08-01  3:05 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Lee Schermerhorn, linux-mm, ak, Nishanth Aravamudan, pj, kxr,
	Mel Gorman, KAMEZAWA Hiroyuki

On Tue, 31 Jul 2007 19:52:23 -0700 (PDT) Christoph Lameter <clameter@sgi.com> wrote:

> On Tue, 31 Jul 2007, Andrew Morton wrote:
> 
> > >
> > > +#define for_each_node_state(node, __state) \
> > > +	for ( (node) = 0; (node) != 0; (node) = 1)
> > 
> > That looks weird.
> 
> Yup and we have committed the usual sin of not testing !NUMA.

ooookay...   I don't think I want to be the first person who gets
to do that, so I shall duck them for -mm2.

I think there were updates pending anyway.   I saw several under-replied-to
patches from Lee but it wasn't clear it they were relevant to these changes
or what.

I'll let things cook a bit more.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 01/14] NUMA: Generic management of nodemasks for various purposes
  2007-08-01  3:05       ` Andrew Morton
@ 2007-08-01  3:14         ` Christoph Lameter
  2007-08-01  3:32           ` Andrew Morton
  2007-08-01 15:58           ` [PATCH 01/14] NUMA: Generic management of nodemasks for various purposes Nishanth Aravamudan
  2007-08-01 15:25         ` Nishanth Aravamudan
  1 sibling, 2 replies; 68+ messages in thread
From: Christoph Lameter @ 2007-08-01  3:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Lee Schermerhorn, linux-mm, ak, Nishanth Aravamudan, pj, kxr,
	Mel Gorman, KAMEZAWA Hiroyuki

On Tue, 31 Jul 2007, Andrew Morton wrote:

> ooookay...   I don't think I want to be the first person who gets
> to do that, so I shall duck them for -mm2.
> 
> I think there were updates pending anyway.   I saw several under-replied-to
> patches from Lee but it wasn't clear it they were relevant to these changes
> or what.

I have not seen those. We also have the issue with slab allocations 
failing on NUMAQ with its HIGHMEM zones. 

Andi wants to drop support for NUMAQ again. Is that possible? NUMA only on 
64 bit?

I have checked the current patchset and the fix into a git archive. 
Those interested in working on this can do a

git pull git://git.kernel.org/pub/scm/linux/kernel/git/christoph/numa.git memoryless_nodes

to get the current patchset (This is a bit rough. Sorry Lee the attribution is screwed
up but we will fix this once I get the hang of it).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 01/14] NUMA: Generic management of nodemasks for various purposes
  2007-08-01  3:14         ` Christoph Lameter
@ 2007-08-01  3:32           ` Andrew Morton
  2007-08-01  3:37             ` Christoph Lameter
                               ` (4 more replies)
  2007-08-01 15:58           ` [PATCH 01/14] NUMA: Generic management of nodemasks for various purposes Nishanth Aravamudan
  1 sibling, 5 replies; 68+ messages in thread
From: Andrew Morton @ 2007-08-01  3:32 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Lee Schermerhorn, linux-mm, ak, Nishanth Aravamudan, pj, kxr,
	Mel Gorman, KAMEZAWA Hiroyuki

On Tue, 31 Jul 2007 20:14:08 -0700 (PDT) Christoph Lameter <clameter@sgi.com> wrote:

> On Tue, 31 Jul 2007, Andrew Morton wrote:
> 
> > ooookay...   I don't think I want to be the first person who gets
> > to do that, so I shall duck them for -mm2.
> > 
> > I think there were updates pending anyway.   I saw several under-replied-to
> > patches from Lee but it wasn't clear it they were relevant to these changes
> > or what.
> 
> I have not seen those. We also have the issue with slab allocations 
> failing on NUMAQ with its HIGHMEM zones. 
> 
> Andi wants to drop support for NUMAQ again. Is that possible? NUMA only on 
> 64 bit?

umm, that would need wide circulation.  I have a feeling that some
implementations of some of the more obscure 32-bit architectures can (or
will) have numa characteristics.  Looks like mips might already.

And doesn't i386 summit do numa?

We could do it, but it would take some chin-scratching.  It'd be good if we
could pull it off.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 01/14] NUMA: Generic management of nodemasks for various purposes
  2007-08-01  3:32           ` Andrew Morton
@ 2007-08-01  3:37             ` Christoph Lameter
       [not found]             ` <Pine.LNX.4.64.0707312151400.2894@schroedinger.engr.sgi.com>
                               ` (3 subsequent siblings)
  4 siblings, 0 replies; 68+ messages in thread
From: Christoph Lameter @ 2007-08-01  3:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Lee Schermerhorn, linux-mm, ak, Nishanth Aravamudan, pj, kxr,
	Mel Gorman, KAMEZAWA Hiroyuki

On Tue, 31 Jul 2007, Andrew Morton wrote:

> > Andi wants to drop support for NUMAQ again. Is that possible? NUMA only on 
> > 64 bit?
> 
> umm, that would need wide circulation.  I have a feeling that some
> implementations of some of the more obscure 32-bit architectures can (or
> will) have numa characteristics.  Looks like mips might already.
> 
> And doesn't i386 summit do numa?
> 
> We could do it, but it would take some chin-scratching.  It'd be good if we
> could pull it off.

Ok then we need to support highmem only nodes.

New flag:

N_HIGHMEMORY

N_HIGHMEMORY means any memory. N_MEMORY means normal memory.

slab etc needs to use N_MEMORY.

pagecache / memory policies can use N_HIGHMEMORY

Or do we want N_SLAB so that we can control which nodes are used by the 
slab allocators?

The effect of memory policies will vary depending on where normal memory 
is available.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 01/14] NUMA: Generic management of nodemasks for various purposes
       [not found]             ` <Pine.LNX.4.64.0707312151400.2894@schroedinger.engr.sgi.com>
@ 2007-08-01  5:07               ` Andrew Morton
  2007-08-01  5:11                 ` Andrew Morton
  2007-08-01  5:22                 ` Christoph Lameter
  0 siblings, 2 replies; 68+ messages in thread
From: Andrew Morton @ 2007-08-01  5:07 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Lee Schermerhorn, linux-mm, ak, Nishanth Aravamudan, pj, kxr,
	Mel Gorman, KAMEZAWA Hiroyuki

On Tue, 31 Jul 2007 21:55:41 -0700 (PDT) Christoph Lameter <clameter@sgi.com> wrote:

> Anyone have a 32 bit NUMA system for testing this out?
> 

test.kernel.org has a NUMAQ

> 
> Available from the git tree at
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/christoph/slab.git memoryless_nodes

Please send 'em against rc1-mm2 (hopefully an hour away, if x86_64 box #2
works) (after runtime testing CONFIG_NUMA=n, please) and I can add them to next -mm
for test.k.o to look at.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 01/14] NUMA: Generic management of nodemasks for various purposes
  2007-08-01  5:07               ` Andrew Morton
@ 2007-08-01  5:11                 ` Andrew Morton
  2007-08-01  5:22                 ` Christoph Lameter
  1 sibling, 0 replies; 68+ messages in thread
From: Andrew Morton @ 2007-08-01  5:11 UTC (permalink / raw)
  To: Christoph Lameter, Lee Schermerhorn, linux-mm, ak,
	Nishanth Aravamudan, pj, kxr, Mel Gorman, KAMEZAWA Hiroyuki

On Tue, 31 Jul 2007 22:07:27 -0700 Andrew Morton <akpm@linux-foundation.org> wrote:

> Please send 'em against rc1-mm2 (hopefully an hour away, if x86_64 box #2
> works)

I spoke too soon. swsusp meets Vaio, blood on floor.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 01/14] NUMA: Generic management of nodemasks for various purposes
  2007-08-01  5:07               ` Andrew Morton
  2007-08-01  5:11                 ` Andrew Morton
@ 2007-08-01  5:22                 ` Christoph Lameter
  2007-08-01 10:24                   ` Mel Gorman
  2007-08-02 16:23                   ` Mel Gorman
  1 sibling, 2 replies; 68+ messages in thread
From: Christoph Lameter @ 2007-08-01  5:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Lee Schermerhorn, linux-mm, ak, Nishanth Aravamudan, pj, kxr,
	Mel Gorman, KAMEZAWA Hiroyuki

On Tue, 31 Jul 2007, Andrew Morton wrote:

> > Anyone have a 32 bit NUMA system for testing this out?
> test.kernel.org has a NUMAQ

Ok someone do this please. SGI still has IA64 issues that need fixing 
after the merge (nothing works on SN2 it seems) and that takes precedence.




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 01/14] NUMA: Generic management of nodemasks for various purposes
  2007-08-01  3:32           ` Andrew Morton
  2007-08-01  3:37             ` Christoph Lameter
       [not found]             ` <Pine.LNX.4.64.0707312151400.2894@schroedinger.engr.sgi.com>
@ 2007-08-01  5:36             ` Paul Mundt
  2007-08-01  9:19             ` Andi Kleen
  2007-08-01 14:03             ` Lee Schermerhorn
  4 siblings, 0 replies; 68+ messages in thread
From: Paul Mundt @ 2007-08-01  5:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Lee Schermerhorn, linux-mm, ak,
	Nishanth Aravamudan, pj, kxr, Mel Gorman, KAMEZAWA Hiroyuki

On Tue, Jul 31, 2007 at 08:32:03PM -0700, Andrew Morton wrote:
> On Tue, 31 Jul 2007 20:14:08 -0700 (PDT) Christoph Lameter <clameter@sgi.com> wrote:
> > Andi wants to drop support for NUMAQ again. Is that possible? NUMA only on 
> > 64 bit?
> 
> umm, that would need wide circulation.  I have a feeling that some
> implementations of some of the more obscure 32-bit architectures can (or
> will) have numa characteristics.  Looks like mips might already.
> 
> And doesn't i386 summit do numa?
> 
> We could do it, but it would take some chin-scratching.  It'd be good if we
> could pull it off.
> 
No, SH also requires this due to the abundance of multiple memories with
varying costs, both in UP and SMP configurations. This was the motivation
behind SLOB + NUMA and the mempolicy work.

In the SMP case we have 4 CPUs and system memory + 5 SRAM blocks,
those blocks not only have differing access costs, there are also
implications for bus and cache controller contention. This works out to
6 nodes in practice, as each one has a differing cost.

More and more embedded processors are shipping with both on-chip and
external SRAM blocks in increasingly larger sizes (from 128kB - 1MB
on-chip, and more shared between CPUs). These often have special
characteristics, like bypassing the cache completely, so it's possible to
map workloads with certain latency constraints there while alleviating
pressure from the snoop controller. Some folks also opt for the SRAM
instead of an L2 due to die constraints, for example. In any event,
current processes make this sort of thing quite common, and I expect
there will be more embedded CPUs with blocks of memory they can't really
do a damn thing with otherwise besides statically allocating it all for a
single application.

As of -rc1, using SLOB on a 128kB SRAM node, I'm left with 124kB usable.
Since we give up a node-local pfn for the pgdat, this is what's expected.
There's still some work to be done in this area, but the current scheme
works well enough. If anything, we should be looking at ways to make it
more light-weight, rather than simply trying to push it all off.

I would expect other embedded platforms with similar use cases to start
adding support as well in the future.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 01/14] NUMA: Generic management of nodemasks for various purposes
  2007-08-01  3:32           ` Andrew Morton
                               ` (2 preceding siblings ...)
  2007-08-01  5:36             ` Paul Mundt
@ 2007-08-01  9:19             ` Andi Kleen
  2007-08-01 14:03             ` Lee Schermerhorn
  4 siblings, 0 replies; 68+ messages in thread
From: Andi Kleen @ 2007-08-01  9:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Lee Schermerhorn, linux-mm,
	Nishanth Aravamudan, pj, kxr, Mel Gorman, KAMEZAWA Hiroyuki

On Wednesday 01 August 2007 05:32:03 Andrew Morton wrote:
> On Tue, 31 Jul 2007 20:14:08 -0700 (PDT) Christoph Lameter <clameter@sgi.com> wrote:
> 
> > On Tue, 31 Jul 2007, Andrew Morton wrote:
> > 
> > > ooookay...   I don't think I want to be the first person who gets
> > > to do that, so I shall duck them for -mm2.
> > > 
> > > I think there were updates pending anyway.   I saw several under-replied-to
> > > patches from Lee but it wasn't clear it they were relevant to these changes
> > > or what.
> > 
> > I have not seen those. We also have the issue with slab allocations 
> > failing on NUMAQ with its HIGHMEM zones. 
> > 
> > Andi wants to drop support for NUMAQ again. Is that possible? NUMA only on 
> > 64 bit?
> 
> umm, that would need wide circulation.  I have a feeling that some
> implementations of some of the more obscure 32-bit architectures can (or
> will) have numa characteristics.  Looks like mips might already.

The problem here is really highmem and NUMA. If they only have lowmem
i guess it would be reasonably easy to support.

> And doesn't i386 summit do numa?

Yes, it does. But I don't think many are run in NUMA mode.


-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 01/14] NUMA: Generic management of nodemasks for various purposes
  2007-08-01  5:22                 ` Christoph Lameter
@ 2007-08-01 10:24                   ` Mel Gorman
  2007-08-02 16:23                   ` Mel Gorman
  1 sibling, 0 replies; 68+ messages in thread
From: Mel Gorman @ 2007-08-01 10:24 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Lee Schermerhorn, linux-mm, ak,
	Nishanth Aravamudan, pj, kxr, KAMEZAWA Hiroyuki

On (31/07/07 22:22), Christoph Lameter didst pronounce:
> On Tue, 31 Jul 2007, Andrew Morton wrote:
> 
> > > Anyone have a 32 bit NUMA system for testing this out?
> > test.kernel.org has a NUMAQ
> 
> Ok someone do this please. SGI still has IA64 issues that need fixing 
> after the merge (nothing works on SN2 it seems) and that takes precedence.
> 

I've queued up what was in the numa.git tree for a number of machines
including elm3b132 and elm3b133 on test.kernel.org again 2.6.23-rc1-mm2. With
the release of -mm though there is a long queue so it'll be several hours
before I have any results; tomorrow if it does not go smoothly.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 01/14] NUMA: Generic management of nodemasks for various purposes
  2007-08-01  3:32           ` Andrew Morton
                               ` (3 preceding siblings ...)
  2007-08-01  9:19             ` Andi Kleen
@ 2007-08-01 14:03             ` Lee Schermerhorn
  2007-08-01 17:41               ` Christoph Lameter
  4 siblings, 1 reply; 68+ messages in thread
From: Lee Schermerhorn @ 2007-08-01 14:03 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, linux-mm, ak, Nishanth Aravamudan, pj, kxr,
	Mel Gorman, KAMEZAWA Hiroyuki

On Tue, 2007-07-31 at 20:32 -0700, Andrew Morton wrote:
> On Tue, 31 Jul 2007 20:14:08 -0700 (PDT) Christoph Lameter <clameter@sgi.com> wrote:
> 
> > On Tue, 31 Jul 2007, Andrew Morton wrote:
> > 
> > > ooookay...   I don't think I want to be the first person who gets
> > > to do that, so I shall duck them for -mm2.

Sorry about not testing on i386 and such.  My only i386 is my laptop--my
"window to the world"--and I tend not to run experimental/development
kernels on it.  [I know, such little faith :-(].  I suppose I could
reconfigure an X86_64 system in the lab with hardware interleaved memory
and try a 32-bit kernel there.  I'll add that to my [ever growing] list
of things to explore...

> > > 
> > > I think there were updates pending anyway.   I saw several under-replied-to
> > > patches from Lee but it wasn't clear it they were relevant to these changes
> > > or what.
> > 
> > I have not seen those. We also have the issue with slab allocations 
> > failing on NUMAQ with its HIGHMEM zones. 
> > 

I think Andrew is referring to the "exclude selected nodes from
interleave policy" and "preferred policy fixups" patches.  Those are
related to the memoryless node patches in the sense that they touch some
of the same lines in mempolicy.c.  However, IMO, those patches shouldn't
gate the memoryless node series once the i386 issues are resolved.

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 01/14] NUMA: Generic management of nodemasks for various purposes
  2007-08-01  3:05       ` Andrew Morton
  2007-08-01  3:14         ` Christoph Lameter
@ 2007-08-01 15:25         ` Nishanth Aravamudan
  1 sibling, 0 replies; 68+ messages in thread
From: Nishanth Aravamudan @ 2007-08-01 15:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Lee Schermerhorn, linux-mm, ak, pj, kxr,
	Mel Gorman, KAMEZAWA Hiroyuki

On 31.07.2007 [20:05:22 -0700], Andrew Morton wrote:
> On Tue, 31 Jul 2007 19:52:23 -0700 (PDT) Christoph Lameter <clameter@sgi.com> wrote:
> 
> > On Tue, 31 Jul 2007, Andrew Morton wrote:
> > 
> > > >
> > > > +#define for_each_node_state(node, __state) \
> > > > +	for ( (node) = 0; (node) != 0; (node) = 1)
> > > 
> > > That looks weird.
> > 
> > Yup and we have committed the usual sin of not testing !NUMA.
> 
> ooookay...   I don't think I want to be the first person who gets
> to do that, so I shall duck them for -mm2.

I'm testing these patches (since they gate my stack of hugetlb
fixes/additions) on:

x86 !NUMA, x86 NUMA, x86_64 !NUMA, x86_64 NUMA, ppc64 !NUMA, ppc64 NUMA
and ia64 NUMA.

I already reported the issue you saw, but hadn't had time to look into
it yet; and also reported the NUMAQ issue which prompted the discussion
of 32-bit NUMA removal.

I'll keep doing that testing and reporting the results.

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 01/14] NUMA: Generic management of nodemasks for various purposes
  2007-08-01  3:14         ` Christoph Lameter
  2007-08-01  3:32           ` Andrew Morton
@ 2007-08-01 15:58           ` Nishanth Aravamudan
  2007-08-01 16:09             ` Nishanth Aravamudan
  2007-08-01 17:47             ` Christoph Lameter
  1 sibling, 2 replies; 68+ messages in thread
From: Nishanth Aravamudan @ 2007-08-01 15:58 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Lee Schermerhorn, linux-mm, ak, pj, kxr,
	Mel Gorman, KAMEZAWA Hiroyuki

On 31.07.2007 [20:14:08 -0700], Christoph Lameter wrote:
> On Tue, 31 Jul 2007, Andrew Morton wrote:
> 
> > ooookay...   I don't think I want to be the first person who gets
> > to do that, so I shall duck them for -mm2.
> > 
> > I think there were updates pending anyway.   I saw several under-replied-to
> > patches from Lee but it wasn't clear it they were relevant to these changes
> > or what.
> 
> I have not seen those. We also have the issue with slab allocations 
> failing on NUMAQ with its HIGHMEM zones. 
> 
> Andi wants to drop support for NUMAQ again. Is that possible? NUMA only on 
> 64 bit?
> 
> I have checked the current patchset and the fix into a git archive. 
> Those interested in working on this can do a
> 
> git pull git://git.kernel.org/pub/scm/linux/kernel/git/christoph/numa.git memoryless_nodes
> 
> to get the current patchset (This is a bit rough. Sorry Lee the attribution is screwed
> up but we will fix this once I get the hang of it).

Are you sure this is uptodate? Acc'g to gitweb, the last commit was July
22... And I don't see a 'memoryless_nodes' ref in `git peek-remote`.

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 01/14] NUMA: Generic management of nodemasks for various purposes
  2007-08-01 15:58           ` [PATCH 01/14] NUMA: Generic management of nodemasks for various purposes Nishanth Aravamudan
@ 2007-08-01 16:09             ` Nishanth Aravamudan
  2007-08-01 17:47             ` Christoph Lameter
  1 sibling, 0 replies; 68+ messages in thread
From: Nishanth Aravamudan @ 2007-08-01 16:09 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Lee Schermerhorn, linux-mm, ak, pj, kxr,
	Mel Gorman, KAMEZAWA Hiroyuki

On 01.08.2007 [08:58:03 -0700], Nishanth Aravamudan wrote:
> On 31.07.2007 [20:14:08 -0700], Christoph Lameter wrote:
> > On Tue, 31 Jul 2007, Andrew Morton wrote:
> > 
> > > ooookay...   I don't think I want to be the first person who gets
> > > to do that, so I shall duck them for -mm2.
> > > 
> > > I think there were updates pending anyway.   I saw several under-replied-to
> > > patches from Lee but it wasn't clear it they were relevant to these changes
> > > or what.
> > 
> > I have not seen those. We also have the issue with slab allocations 
> > failing on NUMAQ with its HIGHMEM zones. 
> > 
> > Andi wants to drop support for NUMAQ again. Is that possible? NUMA only on 
> > 64 bit?
> > 
> > I have checked the current patchset and the fix into a git archive. 
> > Those interested in working on this can do a
> > 
> > git pull git://git.kernel.org/pub/scm/linux/kernel/git/christoph/numa.git memoryless_nodes
> > 
> > to get the current patchset (This is a bit rough. Sorry Lee the attribution is screwed
> > up but we will fix this once I get the hang of it).
> 
> Are you sure this is uptodate? Acc'g to gitweb, the last commit was July
> 22... And I don't see a 'memoryless_nodes' ref in `git peek-remote`.

Bah, sorry, I copied the slab.git that was typo'd elsewhere in this
thread. Indeed, numa.git has the right stuff.

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 01/14] NUMA: Generic management of nodemasks for various purposes
  2007-08-01 14:03             ` Lee Schermerhorn
@ 2007-08-01 17:41               ` Christoph Lameter
  2007-08-01 17:54                 ` Lee Schermerhorn
                                   ` (2 more replies)
  0 siblings, 3 replies; 68+ messages in thread
From: Christoph Lameter @ 2007-08-01 17:41 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Andrew Morton, linux-mm, ak, Nishanth Aravamudan, pj, kxr,
	Mel Gorman, KAMEZAWA Hiroyuki

On Wed, 1 Aug 2007, Lee Schermerhorn wrote:

> I think Andrew is referring to the "exclude selected nodes from
> interleave policy" and "preferred policy fixups" patches.  Those are
> related to the memoryless node patches in the sense that they touch some
> of the same lines in mempolicy.c.  However, IMO, those patches shouldn't
> gate the memoryless node series once the i386 issues are resolved.

Right. I think we first need to get the basic set straight. In order to be 
complete we need to audit all uses of node_online() in the kernel and 
think about those uses. They may require either N_NORMAL_MEMORY or 
N_HIGH_MEMORY depending on the check being for a page cache or a kernel 
allocation.

Then we need to test on esoteric NUMA systems like NUMAQ and embedded.

On the way we may add some additional stuff like interleave policy 
settings, restricting node use for hugh pages and slab etc. 
All of these are likely going to be important for asymmetric NUMA 
configurations that the memoryless_nodes patchset is going to address.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 01/14] NUMA: Generic management of nodemasks for various purposes
  2007-08-01 15:58           ` [PATCH 01/14] NUMA: Generic management of nodemasks for various purposes Nishanth Aravamudan
  2007-08-01 16:09             ` Nishanth Aravamudan
@ 2007-08-01 17:47             ` Christoph Lameter
  1 sibling, 0 replies; 68+ messages in thread
From: Christoph Lameter @ 2007-08-01 17:47 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Andrew Morton, Lee Schermerhorn, linux-mm, ak, pj, kxr,
	Mel Gorman, KAMEZAWA Hiroyuki

On Wed, 1 Aug 2007, Nishanth Aravamudan wrote:

> > I have checked the current patchset and the fix into a git archive. 
> > Those interested in working on this can do a
> > 
> > git pull git://git.kernel.org/pub/scm/linux/kernel/git/christoph/numa.git memoryless_nodes
> > 
> > to get the current patchset (This is a bit rough. Sorry Lee the attribution is screwed
> > up but we will fix this once I get the hang of it).
> 
> Are you sure this is uptodate? Acc'g to gitweb, the last commit was July
> 22... And I don't see a 'memoryless_nodes' ref in `git peek-remote`.

You need to look at the memoryless_nodes branch. Not master.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 01/14] NUMA: Generic management of nodemasks for various purposes
  2007-08-01 17:41               ` Christoph Lameter
@ 2007-08-01 17:54                 ` Lee Schermerhorn
  2007-08-02 20:05                 ` [PATCH/RFC/WIP] cpuset-independent interleave policy Lee Schermerhorn
  2007-08-02 20:19                 ` Audit of "all uses of node_online()" Lee Schermerhorn
  2 siblings, 0 replies; 68+ messages in thread
From: Lee Schermerhorn @ 2007-08-01 17:54 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, linux-mm, ak, Nishanth Aravamudan, pj, kxr,
	Mel Gorman, KAMEZAWA Hiroyuki

On Wed, 2007-08-01 at 10:41 -0700, Christoph Lameter wrote:
> On Wed, 1 Aug 2007, Lee Schermerhorn wrote:
> 
> > I think Andrew is referring to the "exclude selected nodes from
> > interleave policy" and "preferred policy fixups" patches.  Those are
> > related to the memoryless node patches in the sense that they touch some
> > of the same lines in mempolicy.c.  However, IMO, those patches shouldn't
> > gate the memoryless node series once the i386 issues are resolved.
> 
> Right. I think we first need to get the basic set straight. In order to be 
> complete we need to audit all uses of node_online() in the kernel and 
> think about those uses. They may require either N_NORMAL_MEMORY or 
> N_HIGH_MEMORY depending on the check being for a page cache or a kernel 
> allocation.
> 
> Then we need to test on esoteric NUMA systems like NUMAQ and embedded.

And HP's ia64 platform.

> 
> On the way we may add some additional stuff like interleave policy 
> settings, restricting node use for hugh pages and slab etc. 
> All of these are likely going to be important for asymmetric NUMA 
> configurations that the memoryless_nodes patchset is going to address.
> 

Agree.  The memoryless nodes set makes it fairly easy to add these
restrictions, I think, by using the new node_states[] support.  I just
want to keep the discussion going, as our asymmetric platform needs this
support for policies to work as desired.

Later,
Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 01/14] NUMA: Generic management of nodemasks for various purposes
  2007-08-01  5:22                 ` Christoph Lameter
  2007-08-01 10:24                   ` Mel Gorman
@ 2007-08-02 16:23                   ` Mel Gorman
  2007-08-02 20:00                     ` Christoph Lameter
  1 sibling, 1 reply; 68+ messages in thread
From: Mel Gorman @ 2007-08-02 16:23 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Lee Schermerhorn, linux-mm, ak,
	Nishanth Aravamudan, pj, kxr, KAMEZAWA Hiroyuki

On (31/07/07 22:22), Christoph Lameter didst pronounce:
> On Tue, 31 Jul 2007, Andrew Morton wrote:
> 
> > > Anyone have a 32 bit NUMA system for testing this out?
> > test.kernel.org has a NUMAQ
> 
> Ok someone do this please. SGI still has IA64 issues that need fixing 
> after the merge (nothing works on SN2 it seems) and that takes precedence.
> 

With the pci_create_bus() issue fixed up, I was able to boot on numaq
with the patch from your git tree applied. It survived running kernbench,
tbench and hackbench. Nish is looking closer than I am just to be sure.
For reference, the patch I tested on top of 2.6.23-rc1-mm2 with the pci
problem fixed up is below

diff --git a/Documentation/cpusets.txt b/Documentation/cpusets.txt
index f2c0a68..b875d23 100644
--- a/Documentation/cpusets.txt
+++ b/Documentation/cpusets.txt
@@ -35,7 +35,8 @@ CONTENTS:
 ----------------------
 
 Cpusets provide a mechanism for assigning a set of CPUs and Memory
-Nodes to a set of tasks.
+Nodes to a set of tasks.   In this document "Memory Node" refers to
+an on-line node that contains memory.
 
 Cpusets constrain the CPU and Memory placement of tasks to only
 the resources within a tasks current cpuset.  They form a nested
@@ -220,8 +221,8 @@ and name space for cpusets, with a minimum of additional kernel code.
 The cpus and mems files in the root (top_cpuset) cpuset are
 read-only.  The cpus file automatically tracks the value of
 cpu_online_map using a CPU hotplug notifier, and the mems file
-automatically tracks the value of node_online_map using the
-cpuset_track_online_nodes() hook.
+automatically tracks the value of node_states[N_MEMORY]--i.e.,
+nodes with memory--using the cpuset_track_online_nodes() hook.
 
 
 1.4 What are exclusive cpusets ?
diff --git a/arch/ia64/kernel/uncached.c b/arch/ia64/kernel/uncached.c
index c58e933..a7be4f2 100644
--- a/arch/ia64/kernel/uncached.c
+++ b/arch/ia64/kernel/uncached.c
@@ -196,7 +196,7 @@ unsigned long uncached_alloc_page(int starting_nid)
 	nid = starting_nid;
 
 	do {
-		if (!node_online(nid))
+		if (!node_state(nid, N_HIGH_MEMORY))
 			continue;
 		uc_pool = &uncached_pools[nid];
 		if (uc_pool->pool == NULL)
@@ -268,7 +268,7 @@ static int __init uncached_init(void)
 {
 	int nid;
 
-	for_each_online_node(nid) {
+	for_each_node_state(nid, N_ONLINE) {
 		uncached_pools[nid].pool = gen_pool_create(PAGE_SHIFT, nid);
 		mutex_init(&uncached_pools[nid].add_chunk_mutex);
 	}
diff --git a/drivers/char/mspec.c b/drivers/char/mspec.c
index c08a415..862747c 100644
--- a/drivers/char/mspec.c
+++ b/drivers/char/mspec.c
@@ -345,7 +345,7 @@ mspec_init(void)
 		is_sn2 = 1;
 		if (is_shub2()) {
 			ret = -ENOMEM;
-			for_each_online_node(nid) {
+			for_each_node_state(nid, N_ONLINE) {
 				int actual_nid;
 				int nasid;
 				unsigned long phys;
diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index 826b15e..9e633ea 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -93,7 +93,7 @@ static inline nodemask_t cpuset_mems_allowed(struct task_struct *p)
 	return node_possible_map;
 }
 
-#define cpuset_current_mems_allowed (node_online_map)
+#define cpuset_current_mems_allowed (node_states[N_HIGH_MEMORY])
 static inline void cpuset_init_current_mems_allowed(void) {}
 static inline void cpuset_update_task_memory_state(void) {}
 #define cpuset_nodes_subset_current_mems_allowed(nodes) (1)
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index bc68dd9..12a90a1 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -98,22 +98,29 @@ struct vm_area_struct;
 
 static inline enum zone_type gfp_zone(gfp_t flags)
 {
+	int base = 0;
+
+#ifdef CONFIG_NUMA
+	if (flags & __GFP_THISNODE)
+		base = MAX_NR_ZONES;
+#endif
+
 #ifdef CONFIG_ZONE_DMA
 	if (flags & __GFP_DMA)
-		return ZONE_DMA;
+		return base + ZONE_DMA;
 #endif
 #ifdef CONFIG_ZONE_DMA32
 	if (flags & __GFP_DMA32)
-		return ZONE_DMA32;
+		return base + ZONE_DMA32;
 #endif
 	if ((flags & (__GFP_HIGHMEM | __GFP_MOVABLE)) ==
 			(__GFP_HIGHMEM | __GFP_MOVABLE))
-		return ZONE_MOVABLE;
+		return base + ZONE_MOVABLE;
 #ifdef CONFIG_HIGHMEM
 	if (flags & __GFP_HIGHMEM)
-		return ZONE_HIGHMEM;
+		return base + ZONE_HIGHMEM;
 #endif
-	return ZONE_NORMAL;
+	return base + ZONE_NORMAL;
 }
 
 /*
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 3ea68cd..d20cabb 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -324,6 +324,17 @@ struct zone {
 #define MAX_ZONES_PER_ZONELIST (MAX_NUMNODES * MAX_NR_ZONES)
 
 #ifdef CONFIG_NUMA
+
+/*
+ * The NUMA zonelists are doubled becausse we need zonelists that restrict the
+ * allocations to a single node for GFP_THISNODE.
+ *
+ * [0 .. MAX_NR_ZONES -1] 		: Zonelists with fallback
+ * [MAZ_NR_ZONES ... MAZ_ZONELISTS -1]  : No fallback (GFP_THISNODE)
+ */
+#define MAX_ZONELISTS (2 * MAX_NR_ZONES)
+
+
 /*
  * We cache key information from each zonelist for smaller cache
  * footprint when scanning for free pages in get_page_from_freelist().
@@ -389,6 +400,7 @@ struct zonelist_cache {
 	unsigned long last_full_zap;		/* when last zap'd (jiffies) */
 };
 #else
+#define MAX_ZONELISTS MAX_NR_ZONES
 struct zonelist_cache;
 #endif
 
@@ -437,7 +449,7 @@ extern struct page *mem_map;
 struct bootmem_data;
 typedef struct pglist_data {
 	struct zone node_zones[MAX_NR_ZONES];
-	struct zonelist node_zonelists[MAX_NR_ZONES];
+	struct zonelist node_zonelists[MAX_ZONELISTS];
 	int nr_zones;
 #ifdef CONFIG_FLAT_NODE_MEM_MAP
 	struct page *node_mem_map;
diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h
index 52c54a5..1145f33 100644
--- a/include/linux/nodemask.h
+++ b/include/linux/nodemask.h
@@ -338,31 +338,84 @@ static inline void __nodes_remap(nodemask_t *dstp, const nodemask_t *srcp,
 #endif /* MAX_NUMNODES */
 
 /*
+ * Bitmasks that are kept for all the nodes.
+ */
+enum node_states {
+	N_POSSIBLE,		/* The node could become online at some point */
+	N_ONLINE,		/* The node is online */
+	N_NORMAL_MEMORY,	/* The node has regular memory */
+	N_HIGH_MEMORY,		/* The node has regular or high memory */
+	N_CPU, 			/* The node has one or more cpus */
+	NR_NODE_STATES
+};
+
+/*
  * The following particular system nodemasks and operations
  * on them manage all possible and online nodes.
  */
 
-extern nodemask_t node_online_map;
-extern nodemask_t node_possible_map;
+extern nodemask_t node_states[NR_NODE_STATES];
 
 #if MAX_NUMNODES > 1
-#define num_online_nodes()	nodes_weight(node_online_map)
-#define num_possible_nodes()	nodes_weight(node_possible_map)
-#define node_online(node)	node_isset((node), node_online_map)
-#define node_possible(node)	node_isset((node), node_possible_map)
-#define first_online_node	first_node(node_online_map)
-#define next_online_node(nid)	next_node((nid), node_online_map)
+static inline int node_state(int node, enum node_states state)
+{
+	return node_isset(node, node_states[state]);
+}
+
+static inline void node_set_state(int node, enum node_states state)
+{
+	__node_set(node, &node_states[state]);
+}
+
+static inline void node_clear_state(int node, enum node_states state)
+{
+	__node_clear(node, &node_states[state]);
+}
+
+static inline int num_node_state(enum node_states state)
+{
+	return nodes_weight(node_states[state]);
+}
+
+#define for_each_node_state(__node, __state) \
+	for_each_node_mask((__node), node_states[__state])
+
+#define first_online_node	first_node(node_states[N_ONLINE])
+#define next_online_node(nid)	next_node((nid), node_states[N_ONLINE])
+
 extern int nr_node_ids;
 #else
-#define num_online_nodes()	1
-#define num_possible_nodes()	1
-#define node_online(node)	((node) == 0)
-#define node_possible(node)	((node) == 0)
+
+static inline int node_state(int node, enum node_states state)
+{
+	return node == 0;
+}
+
+static inline void node_set_state(int node, enum node_states state)
+{
+}
+
+static inline void node_clear_state(int node, enum node_states state)
+{
+}
+
+static inline int num_node_state(enum node_states state)
+{
+	return 1;
+}
+
+#define for_each_node_state(node, __state) \
+	for ( (node) = 0; (node) == 0; (node) = 1)
+
 #define first_online_node	0
 #define next_online_node(nid)	(MAX_NUMNODES)
 #define nr_node_ids		1
+
 #endif
 
+#define node_online_map 	node_states[N_ONLINE]
+#define node_possible_map 	node_states[N_POSSIBLE]
+
 #define any_online_node(mask)			\
 ({						\
 	int node;				\
@@ -372,10 +425,15 @@ extern int nr_node_ids;
 	node;					\
 })
 
-#define node_set_online(node)	   set_bit((node), node_online_map.bits)
-#define node_set_offline(node)	   clear_bit((node), node_online_map.bits)
+#define num_online_nodes()	num_node_state(N_ONLINE)
+#define num_possible_nodes()	num_node_state(N_POSSIBLE)
+#define node_online(node)	node_state((node), N_ONLINE)
+#define node_possible(node)	node_state((node), N_POSSIBLE)
+
+#define node_set_online(node)	   node_set_state((node), N_ONLINE)
+#define node_set_offline(node)	   node_clear_state((node), N_ONLINE)
 
-#define for_each_node(node)	   for_each_node_mask((node), node_possible_map)
-#define for_each_online_node(node) for_each_node_mask((node), node_online_map)
+#define for_each_node(node)	   for_each_node_state(node, N_POSSIBLE)
+#define for_each_online_node(node) for_each_node_state(node, N_ONLINE)
 
 #endif /* __LINUX_NODEMASK_H */
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 57e6448..8b2daac 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -581,26 +581,28 @@ static void guarantee_online_cpus(const struct cpuset *cs, cpumask_t *pmask)
 
 /*
  * Return in *pmask the portion of a cpusets's mems_allowed that
- * are online.  If none are online, walk up the cpuset hierarchy
- * until we find one that does have some online mems.  If we get
- * all the way to the top and still haven't found any online mems,
- * return node_online_map.
+ * are online, with memory.  If none are online with memory, walk
+ * up the cpuset hierarchy until we find one that does have some
+ * online mems.  If we get all the way to the top and still haven't
+ * found any online mems, return node_states[N_HIGH_MEMORY].
  *
  * One way or another, we guarantee to return some non-empty subset
- * of node_online_map.
+ * of node_states[N_HIGH_MEMORY].
  *
  * Call with callback_mutex held.
  */
 
 static void guarantee_online_mems(const struct cpuset *cs, nodemask_t *pmask)
 {
-	while (cs && !nodes_intersects(cs->mems_allowed, node_online_map))
+	while (cs && !nodes_intersects(cs->mems_allowed,
+					node_states[N_HIGH_MEMORY]))
 		cs = cs->parent;
 	if (cs)
-		nodes_and(*pmask, cs->mems_allowed, node_online_map);
+		nodes_and(*pmask, cs->mems_allowed,
+					node_states[N_HIGH_MEMORY]);
 	else
-		*pmask = node_online_map;
-	BUG_ON(!nodes_intersects(*pmask, node_online_map));
+		*pmask = node_states[N_HIGH_MEMORY];
+	BUG_ON(!nodes_intersects(*pmask, node_states[N_HIGH_MEMORY]));
 }
 
 /**
@@ -924,7 +926,10 @@ static int update_nodemask(struct cpuset *cs, char *buf)
 	int fudge;
 	int retval;
 
-	/* top_cpuset.mems_allowed tracks node_online_map; it's read-only */
+	/*
+	 * top_cpuset.mems_allowed tracks node_stats[N_HIGH_MEMORY];
+	 * it's read-only
+	 */
 	if (cs == &top_cpuset)
 		return -EACCES;
 
@@ -941,8 +946,21 @@ static int update_nodemask(struct cpuset *cs, char *buf)
 		retval = nodelist_parse(buf, trialcs.mems_allowed);
 		if (retval < 0)
 			goto done;
+		if (!nodes_intersects(trialcs.mems_allowed,
+						node_states[N_HIGH_MEMORY])) {
+			/*
+			 * error if only memoryless nodes specified.
+			 */
+			retval = -ENOSPC;
+			goto done;
+		}
 	}
-	nodes_and(trialcs.mems_allowed, trialcs.mems_allowed, node_online_map);
+	/*
+	 * Exclude memoryless nodes.  We know that trialcs.mems_allowed
+	 * contains at least one node with memory.
+	 */
+	nodes_and(trialcs.mems_allowed, trialcs.mems_allowed,
+						node_states[N_HIGH_MEMORY]);
 	oldmem = cs->mems_allowed;
 	if (nodes_equal(oldmem, trialcs.mems_allowed)) {
 		retval = 0;		/* Too easy - nothing to do */
@@ -2098,8 +2116,9 @@ static void guarantee_online_cpus_mems_in_subtree(const struct cpuset *cur)
 
 /*
  * The cpus_allowed and mems_allowed nodemasks in the top_cpuset track
- * cpu_online_map and node_online_map.  Force the top cpuset to track
- * whats online after any CPU or memory node hotplug or unplug event.
+ * cpu_online_map and node_states[N_HIGH_MEMORY].  Force the top cpuset to
+ * track what's online after any CPU or memory node hotplug or unplug
+ * event.
  *
  * To ensure that we don't remove a CPU or node from the top cpuset
  * that is currently in use by a child cpuset (which would violate
@@ -2119,7 +2138,7 @@ static void common_cpu_mem_hotplug_unplug(void)
 
 	guarantee_online_cpus_mems_in_subtree(&top_cpuset);
 	top_cpuset.cpus_allowed = cpu_online_map;
-	top_cpuset.mems_allowed = node_online_map;
+	top_cpuset.mems_allowed = node_states[N_HIGH_MEMORY];
 
 	mutex_unlock(&callback_mutex);
 	mutex_unlock(&manage_mutex);
@@ -2147,8 +2166,9 @@ static int cpuset_handle_cpuhp(struct notifier_block *nb,
 
 #ifdef CONFIG_MEMORY_HOTPLUG
 /*
- * Keep top_cpuset.mems_allowed tracking node_online_map.
- * Call this routine anytime after you change node_online_map.
+ * Keep top_cpuset.mems_allowed tracking node_states[N_HIGH_MEMORY].
+ * Call this routine anytime after you change
+ * node_states[N_HIGH_MEMORY].
  * See also the previous routine cpuset_handle_cpuhp().
  */
 
@@ -2167,7 +2187,7 @@ void cpuset_track_online_nodes(void)
 void __init cpuset_init_smp(void)
 {
 	top_cpuset.cpus_allowed = cpu_online_map;
-	top_cpuset.mems_allowed = node_online_map;
+	top_cpuset.mems_allowed = node_states[N_HIGH_MEMORY];
 
 	hotcpu_notifier(cpuset_handle_cpuhp, 0);
 }
@@ -2309,7 +2329,7 @@ void cpuset_init_current_mems_allowed(void)
  *
  * Description: Returns the nodemask_t mems_allowed of the cpuset
  * attached to the specified @tsk.  Guaranteed to return some non-empty
- * subset of node_online_map, even if this means going outside the
+ * subset of node_states[N_HIGH_MEMORY], even if this means going outside the
  * tasks cpuset.
  **/
 
diff --git a/kernel/profile.c b/kernel/profile.c
index 5b20fe9..ed407f5 100644
--- a/kernel/profile.c
+++ b/kernel/profile.c
@@ -346,7 +346,7 @@ static int __devinit profile_cpu_callback(struct notifier_block *info,
 		per_cpu(cpu_profile_flip, cpu) = 0;
 		if (!per_cpu(cpu_profile_hits, cpu)[1]) {
 			page = alloc_pages_node(node,
-					GFP_KERNEL | __GFP_ZERO | GFP_THISNODE,
+					GFP_KERNEL | __GFP_ZERO,
 					0);
 			if (!page)
 				return NOTIFY_BAD;
@@ -354,7 +354,7 @@ static int __devinit profile_cpu_callback(struct notifier_block *info,
 		}
 		if (!per_cpu(cpu_profile_hits, cpu)[0]) {
 			page = alloc_pages_node(node,
-					GFP_KERNEL | __GFP_ZERO | GFP_THISNODE,
+					GFP_KERNEL | __GFP_ZERO,
 					0);
 			if (!page)
 				goto out_free;
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 71b84b4..93957fe 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -185,7 +185,9 @@ static struct mempolicy *mpol_new(int mode, nodemask_t *nodes)
 	switch (mode) {
 	case MPOL_INTERLEAVE:
 		policy->v.nodes = *nodes;
-		if (nodes_weight(*nodes) == 0) {
+		nodes_and(policy->v.nodes, policy->v.nodes,
+					node_states[N_HIGH_MEMORY]);
+		if (nodes_weight(policy->v.nodes) == 0) {
 			kmem_cache_free(policy_cache, policy);
 			return ERR_PTR(-EINVAL);
 		}
@@ -494,9 +496,9 @@ static void get_zonemask(struct mempolicy *p, nodemask_t *nodes)
 		*nodes = p->v.nodes;
 		break;
 	case MPOL_PREFERRED:
-		/* or use current node instead of online map? */
+		/* or use current node instead of memory_map? */
 		if (p->v.preferred_node < 0)
-			*nodes = node_online_map;
+			*nodes = node_states[N_HIGH_MEMORY];
 		else
 			node_set(p->v.preferred_node, *nodes);
 		break;
@@ -1617,7 +1619,7 @@ void __init numa_policy_init(void)
 	 * fall back to the largest node if they're all smaller.
 	 */
 	nodes_clear(interleave_nodes);
-	for_each_online_node(nid) {
+	for_each_node_state(nid, N_HIGH_MEMORY) {
 		unsigned long total_pages = node_present_pages(nid);
 
 		/* Preserve the largest node */
@@ -1897,7 +1899,7 @@ int show_numa_map(struct seq_file *m, void *v)
 		seq_printf(m, " huge");
 	} else {
 		check_pgd_range(vma, vma->vm_start, vma->vm_end,
-				&node_online_map, MPOL_MF_STATS, md);
+			&node_states[N_HIGH_MEMORY], MPOL_MF_STATS, md);
 	}
 
 	if (!md->pages)
@@ -1924,7 +1926,7 @@ int show_numa_map(struct seq_file *m, void *v)
 	if (md->writeback)
 		seq_printf(m," writeback=%lu", md->writeback);
 
-	for_each_online_node(n)
+	for_each_node_state(n, N_HIGH_MEMORY)
 		if (md->node[n])
 			seq_printf(m, " N%d=%lu", n, md->node[n]);
 out:
diff --git a/mm/migrate.c b/mm/migrate.c
index 37c73b9..0e3e304 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -979,7 +979,7 @@ asmlinkage long sys_move_pages(pid_t pid, unsigned long nr_pages,
 				goto out;
 
 			err = -ENODEV;
-			if (!node_online(node))
+			if (!node_state(node, N_HIGH_MEMORY))
 				goto out;
 
 			err = -EACCES;
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index f9b82ad..41b4e36 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -177,14 +177,7 @@ static inline int constrained_alloc(struct zonelist *zonelist, gfp_t gfp_mask)
 {
 #ifdef CONFIG_NUMA
 	struct zone **z;
-	nodemask_t nodes;
-	int node;
-
-	nodes_clear(nodes);
-	/* node has memory ? */
-	for_each_online_node(node)
-		if (NODE_DATA(node)->node_present_pages)
-			node_set(node, nodes);
+	nodemask_t nodes = node_states[N_HIGH_MEMORY];
 
 	for (z = zonelist->zones; *z; z++)
 		if (cpuset_zone_allowed_softwall(*z, gfp_mask))
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3da85b8..1d8e4c8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -47,13 +47,14 @@
 #include "internal.h"
 
 /*
- * MCD - HACK: Find somewhere to initialize this EARLY, or make this
- * initializer cleaner
+ * Array of node states.
  */
-nodemask_t node_online_map __read_mostly = { { [0] = 1UL } };
-EXPORT_SYMBOL(node_online_map);
-nodemask_t node_possible_map __read_mostly = NODE_MASK_ALL;
-EXPORT_SYMBOL(node_possible_map);
+nodemask_t node_states[NR_NODE_STATES] __read_mostly = {
+	[N_POSSIBLE] = NODE_MASK_ALL,
+	[N_ONLINE] = { { [0] = 1UL } }
+};
+EXPORT_SYMBOL(node_states);
+
 unsigned long totalram_pages __read_mostly;
 unsigned long totalreserve_pages __read_mostly;
 long nr_swap_pages;
@@ -1170,9 +1171,6 @@ zonelist_scan:
 			!zlc_zone_worth_trying(zonelist, z, allowednodes))
 				continue;
 		zone = *z;
-		if (unlikely(NUMA_BUILD && (gfp_mask & __GFP_THISNODE) &&
-			zone->zone_pgdat != zonelist->zones[0]->zone_pgdat))
-				break;
 		if ((alloc_flags & ALLOC_CPUSET) &&
 			!cpuset_zone_allowed_softwall(zone, gfp_mask))
 				goto try_next_zone;
@@ -1241,7 +1239,10 @@ restart:
 	z = zonelist->zones;  /* the list of zones suitable for gfp_mask */
 
 	if (unlikely(*z == NULL)) {
-		/* Should this ever happen?? */
+		/*
+		 * Happens if we have an empty zonelist as a result of
+		 * GFP_THISNODE being used on a memoryless node
+		 */
 		return NULL;
 	}
 
@@ -1837,6 +1838,22 @@ static void build_zonelists_in_node_order(pg_data_t *pgdat, int node)
 }
 
 /*
+ * Build gfp_thisnode zonelists
+ */
+static void build_thisnode_zonelists(pg_data_t *pgdat)
+{
+	enum zone_type i;
+	int j;
+	struct zonelist *zonelist;
+
+	for (i = 0; i < MAX_NR_ZONES; i++) {
+		zonelist = pgdat->node_zonelists + MAX_NR_ZONES + i;
+		j = build_zonelists_node(pgdat, zonelist, 0, i);
+		zonelist->zones[j] = NULL;
+	}
+}
+
+/*
  * Build zonelists ordered by zone and nodes within zones.
  * This results in conserving DMA zone[s] until all Normal memory is
  * exhausted, but results in overflowing to remote node while memory
@@ -1940,7 +1957,7 @@ static void build_zonelists(pg_data_t *pgdat)
 	int order = current_zonelist_order;
 
 	/* initialize zonelists */
-	for (i = 0; i < MAX_NR_ZONES; i++) {
+	for (i = 0; i < MAX_ZONELISTS; i++) {
 		zonelist = pgdat->node_zonelists + i;
 		zonelist->zones[0] = NULL;
 	}
@@ -1985,6 +2002,8 @@ static void build_zonelists(pg_data_t *pgdat)
 		/* calculate node order -- i.e., DMA last! */
 		build_zonelists_in_zone_order(pgdat, j);
 	}
+
+	build_thisnode_zonelists(pgdat);
 }
 
 /* Construct the zonelist performance cache - see further mmzone.h */
@@ -2063,10 +2082,23 @@ static void build_zonelist_cache(pg_data_t *pgdat)
 static int __build_all_zonelists(void *dummy)
 {
 	int nid;
+	enum zone_type zone;
 
 	for_each_online_node(nid) {
-		build_zonelists(NODE_DATA(nid));
-		build_zonelist_cache(NODE_DATA(nid));
+		pg_data_t *pgdat = NODE_DATA(nid);
+
+		build_zonelists(pgdat);
+		build_zonelist_cache(pgdat);
+
+		/* Any memory on that node */
+		if (pgdat->node_present_pages)
+			node_set_state(nid, N_HIGH_MEMORY);
+
+		/* Any regular memory on that node ? */
+		for (zone = 0; zone <= ZONE_NORMAL; zone++)
+			if (pgdat->node_zones[zone].present_pages)
+				node_set_state(nid, N_NORMAL_MEMORY);
+
 	}
 	return 0;
 }
@@ -2311,6 +2343,7 @@ static struct per_cpu_pageset boot_pageset[NR_CPUS];
 static int __cpuinit process_zones(int cpu)
 {
 	struct zone *zone, *dzone;
+	int node = cpu_to_node(cpu);
 
 	for_each_zone(zone) {
 
@@ -2318,7 +2351,7 @@ static int __cpuinit process_zones(int cpu)
 			continue;
 
 		zone_pcp(zone, cpu) = kmalloc_node(sizeof(struct per_cpu_pageset),
-					 GFP_KERNEL, cpu_to_node(cpu));
+					 GFP_KERNEL, node);
 		if (!zone_pcp(zone, cpu))
 			goto bad;
 
@@ -2329,6 +2362,7 @@ static int __cpuinit process_zones(int cpu)
 			 	(zone->present_pages / percpu_pagelist_fraction));
 	}
 
+	node_set_state(node, N_CPU);
 	return 0;
 bad:
 	for_each_zone(dzone) {
@@ -2665,10 +2699,8 @@ void __meminit get_pfn_range_for_nid(unsigned int nid,
 		*end_pfn = max(*end_pfn, early_node_map[i].end_pfn);
 	}
 
-	if (*start_pfn == -1UL) {
-		printk(KERN_WARNING "Node %u active with no memory\n", nid);
+	if (*start_pfn == -1UL)
 		*start_pfn = 0;
-	}
 
 	/* Push the node boundaries out if requested */
 	account_node_boundary(nid, start_pfn, end_pfn);
diff --git a/mm/slab.c b/mm/slab.c
index a684778..73adca9 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1565,7 +1565,7 @@ void __init kmem_cache_init(void)
 		/* Replace the static kmem_list3 structures for the boot cpu */
 		init_list(&cache_cache, &initkmem_list3[CACHE_CACHE], node);
 
-		for_each_online_node(nid) {
+		for_each_node_state(nid, N_NORMAL_MEMORY) {
 			init_list(malloc_sizes[INDEX_AC].cs_cachep,
 				  &initkmem_list3[SIZE_AC + nid], nid);
 
@@ -1941,7 +1941,7 @@ static void __init set_up_list3s(struct kmem_cache *cachep, int index)
 {
 	int node;
 
-	for_each_online_node(node) {
+	for_each_node_state(node, N_NORMAL_MEMORY) {
 		cachep->nodelists[node] = &initkmem_list3[index + node];
 		cachep->nodelists[node]->next_reap = jiffies +
 		    REAPTIMEOUT_LIST3 +
@@ -2072,7 +2072,7 @@ static int __init_refok setup_cpu_cache(struct kmem_cache *cachep)
 			g_cpucache_up = PARTIAL_L3;
 		} else {
 			int node;
-			for_each_online_node(node) {
+			for_each_node_state(node, N_NORMAL_MEMORY) {
 				cachep->nodelists[node] =
 				    kmalloc_node(sizeof(struct kmem_list3),
 						GFP_KERNEL, node);
@@ -3782,7 +3782,7 @@ static int alloc_kmemlist(struct kmem_cache *cachep)
 	struct array_cache *new_shared;
 	struct array_cache **new_alien = NULL;
 
-	for_each_online_node(node) {
+	for_each_node_state(node, N_NORMAL_MEMORY) {
 
                 if (use_alien_caches) {
                         new_alien = alloc_alien_cache(node, cachep->limit);
diff --git a/mm/slub.c b/mm/slub.c
index 6c6d74f..e5fe0a9 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1904,7 +1904,7 @@ static void free_kmem_cache_nodes(struct kmem_cache *s)
 {
 	int node;
 
-	for_each_online_node(node) {
+	for_each_node_state(node, N_NORMAL_MEMORY) {
 		struct kmem_cache_node *n = s->node[node];
 		if (n && n != &s->local_node)
 			kmem_cache_free(kmalloc_caches, n);
@@ -1922,7 +1922,7 @@ static int init_kmem_cache_nodes(struct kmem_cache *s, gfp_t gfpflags)
 	else
 		local_node = 0;
 
-	for_each_online_node(node) {
+	for_each_node_state(node, N_NORMAL_MEMORY) {
 		struct kmem_cache_node *n;
 
 		if (local_node == node)
@@ -2176,7 +2176,7 @@ static inline int kmem_cache_close(struct kmem_cache *s)
 	flush_all(s);
 
 	/* Attempt to free all objects */
-	for_each_online_node(node) {
+	for_each_node_state(node, N_NORMAL_MEMORY) {
 		struct kmem_cache_node *n = get_node(s, node);
 
 		n->nr_partial -= free_list(s, n, &n->partial);
@@ -2471,7 +2471,7 @@ int kmem_cache_shrink(struct kmem_cache *s)
 		return -ENOMEM;
 
 	flush_all(s);
-	for_each_online_node(node) {
+	for_each_node_state(node, N_NORMAL_MEMORY) {
 		n = get_node(s, node);
 
 		if (!n->nr_partial)
@@ -2861,7 +2861,7 @@ static long validate_slab_cache(struct kmem_cache *s)
 		return -ENOMEM;
 
 	flush_all(s);
-	for_each_online_node(node) {
+	for_each_node_state(node, N_NORMAL_MEMORY) {
 		struct kmem_cache_node *n = get_node(s, node);
 
 		count += validate_slab_node(s, n, map);
@@ -3081,7 +3081,7 @@ static int list_locations(struct kmem_cache *s, char *buf,
 	/* Push back cpu slabs */
 	flush_all(s);
 
-	for_each_online_node(node) {
+	for_each_node_state(node, N_NORMAL_MEMORY) {
 		struct kmem_cache_node *n = get_node(s, node);
 		unsigned long flags;
 		struct page *page;
@@ -3208,7 +3208,7 @@ static unsigned long slab_objects(struct kmem_cache *s,
 		}
 	}
 
-	for_each_online_node(node) {
+	for_each_node_state(node, N_NORMAL_MEMORY) {
 		struct kmem_cache_node *n = get_node(s, node);
 
 		if (flags & SO_PARTIAL) {
@@ -3236,7 +3236,7 @@ static unsigned long slab_objects(struct kmem_cache *s,
 
 	x = sprintf(buf, "%lu", total);
 #ifdef CONFIG_NUMA
-	for_each_online_node(node)
+	for_each_node_state(node, N_NORMAL_MEMORY)
 		if (nodes[node])
 			x += sprintf(buf + x, " N%d=%lu",
 					node, nodes[node]);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index d419e10..f7fe92d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1674,7 +1674,7 @@ static int __init kswapd_init(void)
 	int nid;
 
 	swap_setup();
-	for_each_online_node(nid)
+	for_each_node_state(nid, N_HIGH_MEMORY)
  		kswapd_run(nid);
 	hotcpu_notifier(cpu_callback, 0);
 	return 0;
@@ -1794,7 +1794,6 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 
 int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 {
-	cpumask_t mask;
 	int node_id;
 
 	/*
@@ -1831,8 +1830,7 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 	 * as wide as possible.
 	 */
 	node_id = zone_to_nid(zone);
-	mask = node_to_cpumask(node_id);
-	if (!cpus_empty(mask) && node_id != numa_node_id())
+	if (node_state(node_id, N_CPU) && node_id != numa_node_id())
 		return 0;
 	return __zone_reclaim(zone, gfp_mask, order);
 }

-- 
-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 01/14] NUMA: Generic management of nodemasks for various purposes
  2007-08-02 16:23                   ` Mel Gorman
@ 2007-08-02 20:00                     ` Christoph Lameter
  0 siblings, 0 replies; 68+ messages in thread
From: Christoph Lameter @ 2007-08-02 20:00 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Lee Schermerhorn, linux-mm, ak,
	Nishanth Aravamudan, pj, kxr, KAMEZAWA Hiroyuki

On Thu, 2 Aug 2007, Mel Gorman wrote:

> With the pci_create_bus() issue fixed up, I was able to boot on numaq
> with the patch from your git tree applied. It survived running kernbench,

Thanks!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH/RFC/WIP]  cpuset-independent interleave policy
  2007-08-01 17:41               ` Christoph Lameter
  2007-08-01 17:54                 ` Lee Schermerhorn
@ 2007-08-02 20:05                 ` Lee Schermerhorn
  2007-08-02 20:34                   ` Christoph Lameter
  2007-08-02 20:19                 ` Audit of "all uses of node_online()" Lee Schermerhorn
  2 siblings, 1 reply; 68+ messages in thread
From: Lee Schermerhorn @ 2007-08-02 20:05 UTC (permalink / raw)
  To: linux-mm
  Cc: Christoph Lameter, ak, Nishanth Aravamudan, pj, kxr, Mel Gorman,
	KAMEZAWA Hiroyuki, Andrew Morton, Eric Whitney

Against:  2.6.23-rc1-mm2 atop memoryless node patches with my patch
          to exclude selected nodes from interleave.

Work in Progress -- for discussion and comment

Interleave memory policy uses physical node ideas.  When
a task executes in a cpuset, any policies that it installs
are constrained to use only nodes that are valid in the cpuset.
This makes is difficult to use shared policies--e.g., on shmem/shm
segments--in this environment; especially in disjoint cpusets.  Any
policy installed by a task in one of the cpusets is invalid in a
disjoint cpuset.

Local allocation, whether as a result of default policy or preferred
policy with the local preferred_node token [-1 internally, null/empty
nodemask in the APIs], does not suffer from this problem.  It is a
"context dependent" or cpuset-independent policy.

This patch introduces a cpuset-independent interleave policy that will
work in shared policies applied to shared memory segments attached by
tasks in disjoint cpusets.  The cpuset-independent policy effectively
says "interleave across all valid nodes in the context where page
allocation occurs."

API:  following the lead of the "preferred local" policy, a null or
empty node mask specified with MPOL_INTERLEAVE specifies "all nodes
valid in the allocating context."  

Internally, it's not quite as easy as storing a special token [node
id == -1] in the preferred_node member.  MPOL_INTERLEAVE policy uses
a nodemask embedded in the mempolicy structure.  The nodemask is
"unioned" with preferred_node.   The only otherwise invalid value of
the nodemask that one could use to indicate the context-dependent interleave
mask is the empty set.  Coding-wise this would be simple:

	if (nodes_empty(mpol->v.nodes)) ...

However, this will involve testing possibly several words of
bitmask.  Instead, I chose to encode the "context-dependent policy"
indication in the upper bits of the policy member of the mempolicy
structure.  This member must already be tested to determine the
policy mode, so no extra memory references should be required.
However, for testing the policy--e.g., in the several switch()
and if() statements--the context flag must be masked off using the
policy_mode() inline function.  On the upside, this allows additional
flags to be so encoded, should that become useful.

Another potential issue is that this requires fetching the interleave
nodemask--either from the mempolicy struct or current_cpuset_mems_allowed,
depending on the context flag, during page allocation time.  However,
interleaving is already a fairly heavy-weight policy, so maybe this won't
be noticable.  I WILL take some performance data, "real soon now".

Functionally tested OK.  i.e., it appears to work.

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

 include/linux/mempolicy.h |   16 ++++++++++
 mm/mempolicy.c            |   72 +++++++++++++++++++++++++++++++---------------
 2 files changed, 66 insertions(+), 22 deletions(-)

Index: Linux/mm/mempolicy.c
===================================================================
--- Linux.orig/mm/mempolicy.c	2007-08-02 15:42:18.000000000 -0400
+++ Linux/mm/mempolicy.c	2007-08-02 15:42:20.000000000 -0400
@@ -123,12 +123,15 @@ static int mpol_check_policy(int mode, n
 			return -EINVAL;
 		break;
 	case MPOL_BIND:
-	case MPOL_INTERLEAVE:
 		/* Preferred will only use the first bit, but allow
 		   more for now. */
 		if (empty)
 			return -EINVAL;
 		break;
+	case MPOL_INTERLEAVE:
+		if (empty)
+			return 0;	/* context dependent interleave */
+		break;
 	}
 	return nodes_subset(*nodes, node_online_map) ? 0 : -EINVAL;
 }
@@ -187,6 +190,10 @@ static struct mempolicy *mpol_new(int mo
 	switch (mode) {
 	case MPOL_INTERLEAVE:
 		policy->v.nodes = *nodes;
+ 		if (nodes_weight(*nodes) == 0) {
+ 			mode |= MPOL_CONTEXT;
+			break;
+		}
 		nodes_and(policy->v.nodes, policy->v.nodes,
 					node_states[N_INTERLEAVE]);
 		if (nodes_weight(policy->v.nodes) == 0) {
@@ -462,6 +469,19 @@ static void mpol_set_task_struct_flag(vo
 	mpol_fix_fork_child_flag(current);
 }
 
+/*
+ * Return node mask of specified [possibly contextualized] interleave policy.
+ */
+static nodemask_t *get_interleave_nodes(struct mempolicy *p)
+{
+	VM_BUG_ON(policy_mode(p) != MPOL_INTERLEAVE);
+
+	if (unlikely(p->policy & MPOL_CONTEXT)) {
+		return &cpuset_current_mems_allowed;
+	}
+	return &p->v.nodes;
+}
+
 /* Set the process memory policy */
 static long do_set_mempolicy(int mode, nodemask_t *nodes)
 {
@@ -475,8 +495,8 @@ static long do_set_mempolicy(int mode, n
 	mpol_free(current->mempolicy);
 	current->mempolicy = new;
 	mpol_set_task_struct_flag();
-	if (new && new->policy == MPOL_INTERLEAVE)
-		current->il_next = first_node(new->v.nodes);
+	if (new && policy_mode(new) == MPOL_INTERLEAVE)
+		current->il_next = first_node(*get_interleave_nodes(new));
 	return 0;
 }
 
@@ -488,7 +508,7 @@ static void get_nodemask(struct mempolic
 	int i;
 
 	nodes_clear(*nodes);
-	switch (p->policy) {
+	switch (policy_mode(p)) {
 	case MPOL_BIND:
 		for (i = 0; p->v.zonelist->zones[i]; i++)
 			node_set(zone_to_nid(p->v.zonelist->zones[i]),
@@ -497,7 +517,7 @@ static void get_nodemask(struct mempolic
 	case MPOL_DEFAULT:
 		break;
 	case MPOL_INTERLEAVE:
-		*nodes = p->v.nodes;
+		*nodes = *get_interleave_nodes(p);
 		break;
 	case MPOL_PREFERRED:
 		/*
@@ -562,7 +582,7 @@ static long do_get_mempolicy(int *policy
 				goto out;
 			*policy = err;
 		} else if (pol == current->mempolicy &&
-				pol->policy == MPOL_INTERLEAVE) {
+				policy_mode(pol) == MPOL_INTERLEAVE) {
 			*policy = current->il_next;
 		} else {
 			err = -EINVAL;
@@ -1105,7 +1125,7 @@ static struct zonelist *zonelist_policy(
 {
 	int nd;
 
-	switch (policy->policy) {
+	switch (policy_mode(policy)) {
 	case MPOL_PREFERRED:
 		nd = policy->v.preferred_node;
 		if (nd < 0)
@@ -1133,13 +1153,13 @@ static struct zonelist *zonelist_policy(
 static unsigned interleave_nodes(struct mempolicy *policy)
 {
 	unsigned nid, next;
-	struct task_struct *me = current;
+	nodemask_t *nodes = get_interleave_nodes(policy);
 
-	nid = me->il_next;
-	next = next_node(nid, policy->v.nodes);
+	nid = current->il_next;
+	next = next_node(nid, *nodes);
 	if (next >= MAX_NUMNODES)
-		next = first_node(policy->v.nodes);
-	me->il_next = next;
+		next = first_node(*nodes);
+	current->il_next = next;
 	return nid;
 }
 
@@ -1149,7 +1169,7 @@ static unsigned interleave_nodes(struct 
  */
 unsigned slab_node(struct mempolicy *policy)
 {
-	int pol = policy ? policy->policy : MPOL_DEFAULT;
+	int pol = policy ? policy_mode(policy) : MPOL_DEFAULT;
 
 	switch (pol) {
 	case MPOL_INTERLEAVE:
@@ -1176,14 +1196,15 @@ unsigned slab_node(struct mempolicy *pol
 static unsigned offset_il_node(struct mempolicy *pol,
 		struct vm_area_struct *vma, unsigned long off)
 {
-	unsigned nnodes = nodes_weight(pol->v.nodes);
+	nodemask_t *nodes = get_interleave_nodes(pol);
+	unsigned nnodes = nodes_weight(*nodes);
 	unsigned target = (unsigned)off % nnodes;
 	int c;
 	int nid = -1;
 
 	c = 0;
 	do {
-		nid = next_node(nid, pol->v.nodes);
+		nid = next_node(nid, *nodes);
 		c++;
 	} while (c <= target);
 	return nid;
@@ -1218,7 +1239,7 @@ struct zonelist *huge_zonelist(struct vm
 {
 	struct mempolicy *pol = get_vma_policy(current, vma, addr);
 
-	if (pol->policy == MPOL_INTERLEAVE) {
+	if (policy_mode(pol) == MPOL_INTERLEAVE) {
 		unsigned nid;
 
 		nid = interleave_nid(pol, vma, addr, HPAGE_SHIFT);
@@ -1272,7 +1293,7 @@ alloc_page_vma(gfp_t gfp, struct vm_area
 
 	cpuset_update_task_memory_state();
 
-	if (unlikely(pol->policy == MPOL_INTERLEAVE)) {
+	if (unlikely(policy_mode(pol) == MPOL_INTERLEAVE)) {
 		unsigned nid;
 
 		nid = interleave_nid(pol, vma, addr, PAGE_SHIFT);
@@ -1308,7 +1329,7 @@ struct page *alloc_pages_current(gfp_t g
 		cpuset_update_task_memory_state();
 	if (!pol || in_interrupt() || (gfp & __GFP_THISNODE))
 		pol = &default_policy;
-	if (pol->policy == MPOL_INTERLEAVE)
+	if (policy_mode(pol) == MPOL_INTERLEAVE)
 		return alloc_page_interleave(gfp, order, interleave_nodes(pol));
 	return __alloc_pages(gfp, order, zonelist_policy(gfp, pol));
 }
@@ -1353,11 +1374,12 @@ int __mpol_equal(struct mempolicy *a, st
 		return 0;
 	if (a->policy != b->policy)
 		return 0;
-	switch (a->policy) {
+	switch (policy_mode(a)) {
 	case MPOL_DEFAULT:
 		return 1;
 	case MPOL_INTERLEAVE:
-		return nodes_equal(a->v.nodes, b->v.nodes);
+		return a->policy & MPOL_CONTEXT ||
+			nodes_equal(a->v.nodes, b->v.nodes);
 	case MPOL_PREFERRED:
 		return a->v.preferred_node == b->v.preferred_node;
 	case MPOL_BIND: {
@@ -1679,6 +1701,11 @@ static void mpol_rebind_policy(struct me
 		current->il_next = node_remap(current->il_next,
 						*mpolmask, *newmask);
 		break;
+	case MPOL_INTERLEAVE|MPOL_CONTEXT:
+		/*
+		 * No remap necessary for contextual interleave
+		 */
+		break;
 	case MPOL_PREFERRED:
 		/*
 		 * no need to remap "local policy"
@@ -1765,7 +1792,7 @@ static inline int mpol_to_str(char *buff
 	char *p = buffer;
 	int nid, l;
 	nodemask_t nodes;
-	int mode = pol ? pol->policy : MPOL_DEFAULT;
+	int mode = pol ? policy_mode(pol) : MPOL_DEFAULT;
 
 	switch (mode) {
 	case MPOL_DEFAULT:
@@ -1790,7 +1817,8 @@ static inline int mpol_to_str(char *buff
 		break;
 
 	case MPOL_INTERLEAVE:
-		nodes = pol->v.nodes;
+		nodes = *get_interleave_nodes(pol);
+		// TODO:  or show indication of context-dependent interleave?
 		break;
 
 	default:
Index: Linux/include/linux/mempolicy.h
===================================================================
--- Linux.orig/include/linux/mempolicy.h	2007-08-02 15:42:18.000000000 -0400
+++ Linux/include/linux/mempolicy.h	2007-08-02 15:42:20.000000000 -0400
@@ -15,6 +15,13 @@
 #define MPOL_INTERLEAVE	3
 
 #define MPOL_MAX MPOL_INTERLEAVE
+#define MPOL_MODE 0x0ff		/* reserve 8 bits for policy "mode" */
+
+/*
+ * OR'd into struct mempolicy 'policy' member for "context-dependent interleave" --
+ * i.e., interleave across all nodes allowed in current context.
+ */
+#define MPOL_CONTEXT  (1 << 8)
 
 /* Flags for get_mem_policy */
 #define MPOL_F_NODE	(1<<0)	/* return next IL mode instead of node mask */
@@ -72,6 +79,15 @@ struct mempolicy {
 };
 
 /*
+ * Return 'policy' [a.k.a. 'mode'] member of mpol, less CONTEXT
+ * or any other modifiers.
+ */
+static inline int policy_mode(struct mempolicy *mpol)
+{
+	return mpol->policy & MPOL_MODE;
+}
+
+/*
  * Support for managing mempolicy data objects (clone, copy, destroy)
  * The default fast path of a NULL MPOL_DEFAULT policy is always inlined.
  */


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Audit of "all uses of node_online()"
  2007-08-01 17:41               ` Christoph Lameter
  2007-08-01 17:54                 ` Lee Schermerhorn
  2007-08-02 20:05                 ` [PATCH/RFC/WIP] cpuset-independent interleave policy Lee Schermerhorn
@ 2007-08-02 20:19                 ` Lee Schermerhorn
  2007-08-02 20:26                   ` Christoph Lameter
  2007-08-02 20:33                   ` Audit of "all uses of node_online()" Andrew Morton
  2 siblings, 2 replies; 68+ messages in thread
From: Lee Schermerhorn @ 2007-08-02 20:19 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, linux-mm, ak, Nishanth Aravamudan, pj, kxr,
	Mel Gorman, KAMEZAWA Hiroyuki

Was Re: [PATCH 01/14] NUMA: Generic management of nodemasks for various
purposes

On Wed, 2007-08-01 at 10:41 -0700, Christoph Lameter wrote:
> On Wed, 1 Aug 2007, Lee Schermerhorn wrote:
> 
> > I think Andrew is referring to the "exclude selected nodes from
> > interleave policy" and "preferred policy fixups" patches.  Those are
> > related to the memoryless node patches in the sense that they touch some
> > of the same lines in mempolicy.c.  However, IMO, those patches shouldn't
> > gate the memoryless node series once the i386 issues are resolved.
> 
> Right. I think we first need to get the basic set straight. In order to be 
> complete we need to audit all uses of node_online() in the kernel and 
> think about those uses. They may require either N_NORMAL_MEMORY or 
> N_HIGH_MEMORY depending on the check being for a page cache or a kernel 
> allocation.

Below is a list of files in 23-rc1-mm2 with the memoryless nodes patches
applied [the last ones I posted, not the most recent from Christoph's
tree] that contain the strings 'node_online' or 'online_node'--i.e.
possible uses of the node_online_map or the for_each_online_node macro.
48 files in all, I think.

I have started looking at all of these and I'm preparing a patch to
"fix" the ones that look obviously wrong to me.  Not very far along,
yet, and I won't finish it today.  I won't be in on Friday [or the
weekend :-)], but will continue next week.

Note that the list includes a lot of architectural dependent files.
Shall I do a separate patch for each arch, so that arch maintainer can
focus on that [I assume they'll want to review], or a single "jumbo
patch" to reduce traffic?

Lee


------------

arch/alpha/mm/numa.c
arch/arm/mm/init.c
arch/avr32/kernel/setup.c
arch/avr32/mm/init.c
arch/i386/kernel/numaq.c
arch/i386/kernel/setup.c
arch/i386/kernel/srat.c
arch/i386/kernel/topology.c
arch/i386/mm/discontig.c
arch/i386/pci/numa.c
arch/ia64/kernel/acpi.c
arch/ia64/kernel/topology.c
arch/ia64/mm/discontig.c
arch/ia64/sn/kernel/setup.c
arch/ia64/sn/kernel/sn2/prominfo_proc.c
arch/ia64/sn/kernel/sn2/sn_hwperf.c
arch/ia64/sn/kernel/xpc_partition.c
arch/m32r/kernel/setup.c
arch/m32r/mm/discontig.c
arch/m32r/mm/init.c
arch/mips/kernel/topology.c
arch/mips/sgi-ip27/ip27-klnuma.c
arch/mips/sgi-ip27/ip27-memory.c
arch/mips/sgi-ip27/ip27-nmi.c
arch/mips/sgi-ip27/ip27-reset.c
arch/mips/sgi-ip27/ip27-smp.c
arch/parisc/mm/init.c
arch/powerpc/mm/mem.c
arch/powerpc/mm/numa.c
arch/powerpc/platforms/cell/iommu.c
arch/powerpc/platforms/cell/spufs/sched.c
arch/sh/kernel/setup.c
arch/sh/kernel/topology.c
arch/sh/mm/init.c
arch/x86_64/kernel/pci-dma.c
arch/x86_64/kernel/setup.c
arch/x86_64/mm/numa.c
drivers/base/node.c
drivers/char/mmtimer.c
include/linux/nodemask.h
include/linux/topology.h
mm/mempolicy.c
	? should BIND nodes be limited to nodes with memory?
	? ALL policies in mpol_new()?
	? should mpol_check_policy() require a subset of nodes with memory?
mm/shmem.c
	fixed mount option parsing and superblock setup.
mm/page-writeback.c
	fixed highmem_dirtyable_memory() to just look at N_MEMORY
mm/page_alloc.c
mm/slab.c
mm/swap_prefetch.c
	fixed clear_{last|current}_prefetch_free()
net/sunrpc/svc.c
	fixed svc_pool_map_choose_mode()


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: Audit of "all uses of node_online()"
  2007-08-02 20:19                 ` Audit of "all uses of node_online()" Lee Schermerhorn
@ 2007-08-02 20:26                   ` Christoph Lameter
  2007-08-08 22:19                     ` Lee Schermerhorn
  2007-08-02 20:33                   ` Audit of "all uses of node_online()" Andrew Morton
  1 sibling, 1 reply; 68+ messages in thread
From: Christoph Lameter @ 2007-08-02 20:26 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Andrew Morton, linux-mm, ak, Nishanth Aravamudan, pj, kxr,
	Mel Gorman, KAMEZAWA Hiroyuki

On Thu, 2 Aug 2007, Lee Schermerhorn wrote:

> > Right. I think we first need to get the basic set straight. In order to be 
> > complete we need to audit all uses of node_online() in the kernel and 
> > think about those uses. They may require either N_NORMAL_MEMORY or 
> > N_HIGH_MEMORY depending on the check being for a page cache or a kernel 
> > allocation.
> 
> Below is a list of files in 23-rc1-mm2 with the memoryless nodes patches
> applied [the last ones I posted, not the most recent from Christoph's
> tree] that contain the strings 'node_online' or 'online_node'--i.e.
> possible uses of the node_online_map or the for_each_online_node macro.
> 48 files in all, I think.

Great thanks.
 
> Note that the list includes a lot of architectural dependent files.
> Shall I do a separate patch for each arch, so that arch maintainer can
> focus on that [I assume they'll want to review], or a single "jumbo
> patch" to reduce traffic?

Separate arch patches would be good.

> include/linux/topology.h
> mm/mempolicy.c
> 	? should BIND nodes be limited to nodes with memory?

Or it could automatically limit to those by anding with N_HIGH_MEMORY?

> 	? ALL policies in mpol_new()?
> 	? should mpol_check_policy() require a subset of nodes with memory?

Yea difficult question. What would be impact be if we require that? A node 
going down could cause the application to fail?

> mm/shmem.c
> 	fixed mount option parsing and superblock setup.
> mm/page-writeback.c
> 	fixed highmem_dirtyable_memory() to just look at N_MEMORY

N_HIGH_MEMORY right?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: Audit of "all uses of node_online()"
  2007-08-02 20:19                 ` Audit of "all uses of node_online()" Lee Schermerhorn
  2007-08-02 20:26                   ` Christoph Lameter
@ 2007-08-02 20:33                   ` Andrew Morton
  2007-08-02 20:45                     ` Lee Schermerhorn
  1 sibling, 1 reply; 68+ messages in thread
From: Andrew Morton @ 2007-08-02 20:33 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Christoph Lameter, linux-mm, ak, Nishanth Aravamudan, pj, kxr,
	Mel Gorman, KAMEZAWA Hiroyuki

On Thu, 02 Aug 2007 16:19:53 -0400
Lee Schermerhorn <Lee.Schermerhorn@hp.com> wrote:

> Note that the list includes a lot of architectural dependent files.
> Shall I do a separate patch for each arch, so that arch maintainer can
> focus on that [I assume they'll want to review], or a single "jumbo
> patch" to reduce traffic?

Separate patches please, if they are independent.

Even if they are dependencies, a base patch plus a string of
arch patches would be a nice presentation.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH/RFC/WIP]  cpuset-independent interleave policy
  2007-08-02 20:05                 ` [PATCH/RFC/WIP] cpuset-independent interleave policy Lee Schermerhorn
@ 2007-08-02 20:34                   ` Christoph Lameter
  2007-08-02 21:04                     ` Lee Schermerhorn
  0 siblings, 1 reply; 68+ messages in thread
From: Christoph Lameter @ 2007-08-02 20:34 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, ak, Nishanth Aravamudan, pj, kxr, Mel Gorman,
	KAMEZAWA Hiroyuki, Andrew Morton, Eric Whitney

On Thu, 2 Aug 2007, Lee Schermerhorn wrote:

> This patch introduces a cpuset-independent interleave policy that will
> work in shared policies applied to shared memory segments attached by
> tasks in disjoint cpusets.  The cpuset-independent policy effectively
> says "interleave across all valid nodes in the context where page
> allocation occurs."

In order to make this work across policies you also need to have context 
indepedent MPOL_BIND right?

AFAICT we would need something like relative node numbers to make this 
work across all policy types?

Maybe treat the nodemask as a nodemask relative to the nodes of the cpuset
(or other constraint) if a certain flag is set? Nodes that go beyond the 
end of the allowed nodes in a certain context wrap around to the first 
again?

E.g. if you have a cpuset with nodes

 2 5 7

Then a relative nodemask [0] would refer to node 2. [1] to node 5 and [3] 
to node 7. [0-2] would be referring to all. [0-7] would map to multiple 
nodes.

So you could specify a relative interleave policy on [0-MAX_NUMNODES] and 
it would disperse it evenly across the allowed nodes regardless of the 
cpuset that the policy is being used in?

If we had this then we may be able to avoid translating memory policies 
while migrating processes from cpuset to cpuset. Paul and I talked about 
this a couple of times in the past.

Doing so would fix one of the issues with "memory based" object policies. 
However, there will still be the case where the policy desired for one 
memory area be node local and or interleave depending on the cpuset.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: Audit of "all uses of node_online()"
  2007-08-02 20:33                   ` Audit of "all uses of node_online()" Andrew Morton
@ 2007-08-02 20:45                     ` Lee Schermerhorn
  0 siblings, 0 replies; 68+ messages in thread
From: Lee Schermerhorn @ 2007-08-02 20:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, linux-mm, ak, Nishanth Aravamudan, pj, kxr,
	Mel Gorman, KAMEZAWA Hiroyuki

On Thu, 2007-08-02 at 13:33 -0700, Andrew Morton wrote:
> On Thu, 02 Aug 2007 16:19:53 -0400
> Lee Schermerhorn <Lee.Schermerhorn@hp.com> wrote:
> 
> > Note that the list includes a lot of architectural dependent files.
> > Shall I do a separate patch for each arch, so that arch maintainer can
> > focus on that [I assume they'll want to review], or a single "jumbo
> > patch" to reduce traffic?
> 
> Separate patches please, if they are independent.
> 
> Even if they are dependencies, a base patch plus a string of
> arch patches would be a nice presentation.
> 

Will do.  As I get to them.

I'll repost the file list with annotations as well.  I've already seen
that some files are probably OK as is.

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH/RFC/WIP]  cpuset-independent interleave policy
  2007-08-02 20:34                   ` Christoph Lameter
@ 2007-08-02 21:04                     ` Lee Schermerhorn
  2007-08-03  0:31                       ` Christoph Lameter
  0 siblings, 1 reply; 68+ messages in thread
From: Lee Schermerhorn @ 2007-08-02 21:04 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-mm, ak, Nishanth Aravamudan, pj, kxr, Mel Gorman,
	KAMEZAWA Hiroyuki, Andrew Morton, Eric Whitney

On Thu, 2007-08-02 at 13:34 -0700, Christoph Lameter wrote:
> On Thu, 2 Aug 2007, Lee Schermerhorn wrote:
> 
> > This patch introduces a cpuset-independent interleave policy that will
> > work in shared policies applied to shared memory segments attached by
> > tasks in disjoint cpusets.  The cpuset-independent policy effectively
> > says "interleave across all valid nodes in the context where page
> > allocation occurs."
> 
> In order to make this work across policies you also need to have context 
> indepedent MPOL_BIND right?

Yeah.  That one's trickier, I think...

> 
> AFAICT we would need something like relative node numbers to make this 
> work across all policy types?
> 
> Maybe treat the nodemask as a nodemask relative to the nodes of the cpuset
> (or other constraint) if a certain flag is set? Nodes that go beyond the 
> end of the allowed nodes in a certain context wrap around to the first 
> again?

One could expose the "MPOL_CONTEXT" flag via the API, but then a task
might have a mix of policy types.  Maybe a per cpuset control to enable
relative node ids?  [see below re: translating policies...]

> 
> 
> E.g. if you have a cpuset with nodes
> 
>  2 5 7
> 
> Then a relative nodemask [0] would refer to node 2. [1] to node 5 and [3] 
> to node 7. [0-2] would be referring to all. [0-7] would map to multiple 
> nodes.
> 
> So you could specify a relative interleave policy on [0-MAX_NUMNODES] and 
> it would disperse it evenly across the allowed nodes regardless of the 
> cpuset that the policy is being used in?

Yeah, but if the # nodes in the node mask aren't a multiple of the # of
memory nodes in the cpuset, you might get more pages on one or more
nodes.

> 
> If we had this then we may be able to avoid translating memory policies 
> while migrating processes from cpuset to cpuset. Paul and I talked about 
> this a couple of times in the past.

I recall that the initial "CpuMemSets" proposal had something like this.
I haven't thought about cpuset relative node ids enough to start to
understand the performance implications of doing the translation.

You might still want to do the translation, but only in the
current->mems_allowed mask.  If we had a per cpuset control [all
policies have absolute or relative node ids], you wouldn't have to look
at the task policy and all of the vma policies in the relative node id
case, since basically, all node masks would be valid

I'll continue to investigate this, as time permits.  And, maybe we'll
hear from Paul when he gets back from vacation.

> 
> Doing so would fix one of the issues with "memory based" object policies. 
> However, there will still be the case where the policy desired for one 
> memory area be node local and or interleave depending on the cpuset.

Yeah, still got a ways to go, huh?  Anyway, I wanted to start folks
thinking about it.

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH/RFC/WIP]  cpuset-independent interleave policy
  2007-08-02 21:04                     ` Lee Schermerhorn
@ 2007-08-03  0:31                       ` Christoph Lameter
  0 siblings, 0 replies; 68+ messages in thread
From: Christoph Lameter @ 2007-08-03  0:31 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, ak, Nishanth Aravamudan, pj, kxr, Mel Gorman,
	KAMEZAWA Hiroyuki, Andrew Morton, Eric Whitney

On Thu, 2 Aug 2007, Lee Schermerhorn wrote:

> > AFAICT we would need something like relative node numbers to make this 
> > work across all policy types?
> > 
> > Maybe treat the nodemask as a nodemask relative to the nodes of the cpuset
> > (or other constraint) if a certain flag is set? Nodes that go beyond the 
> > end of the allowed nodes in a certain context wrap around to the first 
> > again?
> 
> One could expose the "MPOL_CONTEXT" flag via the API, but then a task
> might have a mix of policy types.  Maybe a per cpuset control to enable
> relative node ids?  [see below re: translating policies...]

Maybe generally only use relative nodemasks in a cpuset?

> > to node 7. [0-2] would be referring to all. [0-7] would map to multiple 
> > nodes.
> > 
> > So you could specify a relative interleave policy on [0-MAX_NUMNODES] and 
> > it would disperse it evenly across the allowed nodes regardless of the 
> > cpuset that the policy is being used in?
> 
> Yeah, but if the # nodes in the node mask aren't a multiple of the # of
> memory nodes in the cpuset, you might get more pages on one or more
> nodes.

Ok so we may have to modify interleave to stop on the last relative node 
that has memory and then start over?

> You might still want to do the translation, but only in the
> current->mems_allowed mask.  If we had a per cpuset control [all
> policies have absolute or relative node ids], you wouldn't have to look
> at the task policy and all of the vma policies in the relative node id
> case, since basically, all node masks would be valid

Well maybe simply say all policies in a cpuset use relative numbering. 
period?

> > Doing so would fix one of the issues with "memory based" object policies. 
> > However, there will still be the case where the policy desired for one 
> > memory area be node local and or interleave depending on the cpuset.
> 
> Yeah, still got a ways to go, huh?  Anyway, I wanted to start folks
> thinking about it.

Relative node numbers are a great feature regardless. It would allow one 
to write scripts that can run in any cpuset or write applications that can 
set memory policies without worrying too much about where the nodes are 
located.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: Audit of "all uses of node_online()"
  2007-08-02 20:26                   ` Christoph Lameter
@ 2007-08-08 22:19                     ` Lee Schermerhorn
  2007-08-08 23:40                       ` Christoph Lameter
  0 siblings, 1 reply; 68+ messages in thread
From: Lee Schermerhorn @ 2007-08-08 22:19 UTC (permalink / raw)
  To: Christoph Lameter, ak
  Cc: Andrew Morton, linux-mm, Nishanth Aravamudan, pj, kxr,
	Mel Gorman, KAMEZAWA Hiroyuki

On Thu, 2007-08-02 at 13:26 -0700, Christoph Lameter wrote:
> On Thu, 2 Aug 2007, Lee Schermerhorn wrote:
Just getting back to this...

Andi: I added you to the To: list, as I have specific questions for you
below, regarding mbind() and cpuset constraints.

>  
> > Note that the list includes a lot of architectural dependent files.
> > Shall I do a separate patch for each arch, so that arch maintainer can
> > focus on that [I assume they'll want to review], or a single "jumbo
> > patch" to reduce traffic?
> 
> Separate arch patches would be good.

Yeah, that's what Andrew said.  I'm trying to wrap up the "generic"
patch now...

> 
> > include/linux/topology.h
> > mm/mempolicy.c
> > 	? should BIND nodes be limited to nodes with memory?
> 
> Or it could automatically limit to those by anding with N_HIGH_MEMORY?

That's what I meant.  the "ALL policies..." below is an extension of
this thought.

> 
> > 	? ALL policies in mpol_new()?
> > 	? should mpol_check_policy() require a subset of nodes with memory?
> 
> Yea difficult question. What would be impact be if we require that? A node 
> going down could cause the application to fail?

OK, 

First note that mpol_check_policy() is always called just before
mpol_new() [except in the case of share policy init which is covered by
the fix mentioned below in previous mail re: parsing mount options].
Now, looking at this more, I think mpol_check_policy() could [should?]
ensure that the argument nodemask is non-null after ANDing with the
N_HIGH_MEMORY mask--i.e., contains at least one node with memory.

However, it should first check and allow the special case[s] of an empty
nodemask with MPOL_PREFERRED meaning "local" allocation and, if my
"cpuset-independent" interleave policy is accepted, where an empty node
mask means all allowed nodes in cpuset where allocation occurs.

If mpol_check_policy() did that, we could just mask off nodes w/o memory
in mpol_new() knowing that we'd end up with at least one populated node.
The result of this change would be that we would now silently mask off
invalid nodes--i.e., nodes w/o memory, NOT nodes disallowed by
cpuset--instead of giving an error.  Note that this is the effect, for
interleave policy, of the memoryless node patch to fix interleave
behavior.

As far as effect of node "going down".  Currently, mpol_check_policy()
checks against online nodes.  If one "goes down", it's no longer
on-line, right?  So that check would fail.   I don't think changing it
to nodes with memory would change the user visible behavior. 

Andi:

Somewhat related:  In looking at these, I see that set_mempolicy() calls
contextualize_policy() which first ensures that the nodemask is a subset
of the current task's mems_allowed, returning EINVAL if not.  If the
mask IS a valid subset, it calls mpol_check_new() for additional sanity
checks, as discussed above.

However,  do_mbind() calls mpol_check_policy() directly.  Thus, is
doesn't seem to enforce the cpuset constraints for vma and shared
policy.  Is this intentional--e.g., so shmem policies can specify any
node in the system?  I think the cpuset constraint will be applied later
during page allocation, right?

> 
> > mm/shmem.c
> > 	fixed mount option parsing and superblock setup.

I see that we do inline validation of any policy specified in the mount
options.  Should we use common mpol_check() function?  Or is that too
application specific?

> > mm/page-writeback.c
> > 	fixed highmem_dirtyable_memory() to just look at N_MEMORY
> 
> N_HIGH_MEMORY right?

Yeah.  I hadn't upgraded to your latest patch set when I started this.

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: Audit of "all uses of node_online()"
  2007-08-08 22:19                     ` Lee Schermerhorn
@ 2007-08-08 23:40                       ` Christoph Lameter
  2007-08-16 14:17                         ` [PATCH/RFC] memoryless nodes - fixup uses of node_online_map in generic code Lee Schermerhorn
                                           ` (2 more replies)
  0 siblings, 3 replies; 68+ messages in thread
From: Christoph Lameter @ 2007-08-08 23:40 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: ak, Andrew Morton, linux-mm, Nishanth Aravamudan, pj, kxr,
	Mel Gorman, KAMEZAWA Hiroyuki

On Wed, 8 Aug 2007, Lee Schermerhorn wrote:

> First note that mpol_check_policy() is always called just before
> mpol_new() [except in the case of share policy init which is covered by
> the fix mentioned below in previous mail re: parsing mount options].
> Now, looking at this more, I think mpol_check_policy() could [should?]
> ensure that the argument nodemask is non-null after ANDing with the
> N_HIGH_MEMORY mask--i.e., contains at least one node with memory.

Hmmm... I thought about this yesterday and I thought that maybe the 
nodemask needs to allow all possible nodes? What if the nodemask is going 
to be used to select a node for a device? Or a cpu on a certain set of 
nodes? If we restrict it to the set of valid memory nodes then the policy
can only be used to select memory nodes.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH/RFC] memoryless nodes - fixup uses of node_online_map in generic code
  2007-08-08 23:40                       ` Christoph Lameter
@ 2007-08-16 14:17                         ` Lee Schermerhorn
  2007-08-16 18:33                           ` Christoph Lameter
  2007-08-16 21:10                         ` Lee Schermerhorn
  2007-08-24 16:09                         ` [PATCH] 2.6.23-rc3-mm1 - Move setup of N_CPU node state mask Lee Schermerhorn
  2 siblings, 1 reply; 68+ messages in thread
From: Lee Schermerhorn @ 2007-08-16 14:17 UTC (permalink / raw)
  To: Christoph Lameter, Andrew Morton
  Cc: ak, linux-mm, Nishanth Aravamudan, pj, kxr, Mel Gorman,
	KAMEZAWA Hiroyuki, Eric Whitney

Christoph, Andrew:

Here's a cut at fixing up uses of the online node map in generic code.
I'll look at the archs as I get the time, but I thought it worth getting
the ball rolling on the generic fixes.

Note questions about use of N_HIGH_MEMORY in find_next_best_node() and
population of N_HIGH_MEMORY in early_calculate_totalpages().

Comments?

Lee

-----------------
PATCH/RFC Fix generic usage of node_online_map 

Against 2.6.23-rc2-mm2

mm/shmem.c:shmem_parse_mpol()

	Ensure nodelist is subset of nodes with memory.
	Use node_states[N_HIGH_MEMORY] as default for missing
	nodelist for interleave policy.

mm/shmem.c:shmem_fill_super()

	initialize policy_nodes to node_states[N_HIGH_MEMORY]

mm/page-writeback.c:highmem_dirtyable_memory()

	sum over nodes with memory

mm/swap_prefetch.c:clear_last_prefetch_free()
                   clear_last_current_free()

	use nodes with memory for prefetch nodes.
	just in case ...

mm/page_alloc.c:zlc_setup()

	allowednodes - use nodes with memory.

mm/page_alloc.c:default_zonelist_order()

	average over nodes with memory.

mm/page_alloc.c:find_next_best_node()

	skip nodes w/o memory.
	N_HIGH_MEMORY state mask may not be initialized at this time,
	unless we want to depend on early_calculate_totalpages() [see
	below].  Will ZONE_MOVABLE ever be configurable?

mm/page_alloc.c:find_zone_movable_pfns_for_nodes()

	spread kernelcore over nodes with memory.

	This required calling early_calculate_totalpages()
	unconditionally, and populating N_HIGH_MEMORY node
	state therein from nodes in the early_node_map[].
	If we can depend on this, we can eliminate the
	population of N_HIGH_MEMORY mask from __build_all_zonelists()
	and use the N_HIGH_MEMORY mask in find_next_best_node().

mm/mempolicy.c:mpol_check_policy()

	Ensure nodes specified for policy are subset of
	nodes with memory.


Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

 mm/mempolicy.c      |    2 +-
 mm/page-writeback.c |    2 +-
 mm/page_alloc.c     |   36 ++++++++++++++++++++++++++++--------
 mm/shmem.c          |   13 ++++++++-----
 mm/swap_prefetch.c  |    4 ++--
 5 files changed, 40 insertions(+), 17 deletions(-)

Index: Linux/mm/shmem.c
===================================================================
--- Linux.orig/mm/shmem.c	2007-08-15 10:01:22.000000000 -0400
+++ Linux/mm/shmem.c	2007-08-16 09:48:45.000000000 -0400
@@ -971,7 +971,7 @@ static inline int shmem_parse_mpol(char 
 		*nodelist++ = '\0';
 		if (nodelist_parse(nodelist, *policy_nodes))
 			goto out;
-		if (!nodes_subset(*policy_nodes, node_online_map))
+		if (!nodes_subset(*policy_nodes, node_states[N_HIGH_MEMORY]))
 			goto out;
 	}
 	if (!strcmp(value, "default")) {
@@ -996,9 +996,11 @@ static inline int shmem_parse_mpol(char 
 			err = 0;
 	} else if (!strcmp(value, "interleave")) {
 		*policy = MPOL_INTERLEAVE;
-		/* Default to nodes online if no nodelist */
+		/*
+		 * Default to online nodes with memory if no nodelist
+		 */
 		if (!nodelist)
-			*policy_nodes = node_online_map;
+			*policy_nodes = node_states[N_HIGH_MEMORY];
 		err = 0;
 	}
 out:
@@ -1060,7 +1062,8 @@ shmem_alloc_page(gfp_t gfp, struct shmem
 	return page;
 }
 #else
-static inline int shmem_parse_mpol(char *value, int *policy, nodemask_t *policy_nodes)
+static inline int shmem_parse_mpol(char *value, int *policy,
+						nodemask_t *policy_nodes)
 {
 	return 1;
 }
@@ -2239,7 +2242,7 @@ static int shmem_fill_super(struct super
 	unsigned long blocks = 0;
 	unsigned long inodes = 0;
 	int policy = MPOL_DEFAULT;
-	nodemask_t policy_nodes = node_online_map;
+	nodemask_t policy_nodes = node_states[N_HIGH_MEMORY];
 
 #ifdef CONFIG_TMPFS
 	/*
Index: Linux/mm/page-writeback.c
===================================================================
--- Linux.orig/mm/page-writeback.c	2007-08-15 10:01:22.000000000 -0400
+++ Linux/mm/page-writeback.c	2007-08-15 10:13:49.000000000 -0400
@@ -126,7 +126,7 @@ static unsigned long highmem_dirtyable_m
 	int node;
 	unsigned long x = 0;
 
-	for_each_online_node(node) {
+	for_each_node_state(node, N_HIGH_MEMORY) {
 		struct zone *z =
 			&NODE_DATA(node)->node_zones[ZONE_HIGHMEM];
 
Index: Linux/mm/swap_prefetch.c
===================================================================
--- Linux.orig/mm/swap_prefetch.c	2007-08-15 10:01:22.000000000 -0400
+++ Linux/mm/swap_prefetch.c	2007-08-15 10:13:49.000000000 -0400
@@ -249,7 +249,7 @@ static void clear_last_prefetch_free(voi
 	 * Reset the nodes suitable for prefetching to all nodes. We could
 	 * update the data to take into account memory hotplug if desired..
 	 */
-	sp_stat.prefetch_nodes = node_online_map;
+	sp_stat.prefetch_nodes = node_states[N_HIGH_MEMORY];
 	for_each_node_mask(node, sp_stat.prefetch_nodes) {
 		struct node_stats *ns = &sp_stat.node[node];
 
@@ -261,7 +261,7 @@ static void clear_current_prefetch_free(
 {
 	int node;
 
-	sp_stat.prefetch_nodes = node_online_map;
+	sp_stat.prefetch_nodes = node_states[N_HIGH_MEMORY];
 	for_each_node_mask(node, sp_stat.prefetch_nodes) {
 		struct node_stats *ns = &sp_stat.node[node];
 
Index: Linux/mm/page_alloc.c
===================================================================
--- Linux.orig/mm/page_alloc.c	2007-08-15 10:05:41.000000000 -0400
+++ Linux/mm/page_alloc.c	2007-08-16 09:45:47.000000000 -0400
@@ -1302,7 +1302,7 @@ int zone_watermark_ok(struct zone *z, in
  *
  * If the zonelist cache is present in the passed in zonelist, then
  * returns a pointer to the allowed node mask (either the current
- * tasks mems_allowed, or node_online_map.)
+ * tasks mems_allowed, or node_states[N_HIGH_MEMORY].)
  *
  * If the zonelist cache is not available for this zonelist, does
  * nothing and returns NULL.
@@ -1331,7 +1331,7 @@ static nodemask_t *zlc_setup(struct zone
 
 	allowednodes = !in_interrupt() && (alloc_flags & ALLOC_CPUSET) ?
 					&cpuset_current_mems_allowed :
-					&node_online_map;
+					&node_states[N_HIGH_MEMORY];
 	return allowednodes;
 }
 
@@ -2128,8 +2128,17 @@ static int find_next_best_node(int node,
 	}
 
 	for_each_online_node(n) {
+		pg_data_t *pgdat = NODE_DATA(n);
 		cpumask_t tmp;
 
+		/*
+		 * skip nodes w/o memory.
+		 * Note:  N_HIGH_MEMORY state not guaranteed to be
+		 *        populated yet.
+		 */
+		if (pgdat->node_present_pages)
+			continue;
+
 		/* Don't want a node to appear more than once */
 		if (node_isset(n, *used_node_mask))
 			continue;
@@ -2264,7 +2273,8 @@ static int default_zonelist_order(void)
   	 * If there is a node whose DMA/DMA32 memory is very big area on
  	 * local memory, NODE_ORDER may be suitable.
          */
-	average_size = total_size / (num_online_nodes() + 1);
+	average_size = total_size /
+				(nodes_weight(node_states[N_HIGH_MEMORY]) + 1);
 	for_each_online_node(nid) {
 		low_kmem_size = 0;
 		total_size = 0;
@@ -3750,14 +3760,24 @@ unsigned long __init find_max_pfn_with_a
 	return max_pfn;
 }
 
+/*
+ * early_calculate_totalpages()
+ * Sum pages in active regions for movable zone.
+ * Populate N_HIGH_MEMORY for calculating usable_nodes.
+ */
 static unsigned long __init early_calculate_totalpages(void)
 {
 	int i;
 	unsigned long totalpages = 0;
 
-	for (i = 0; i < nr_nodemap_entries; i++)
-		totalpages += early_node_map[i].end_pfn -
+	for (i = 0; i < nr_nodemap_entries; i++) {
+		unsigned long pages = early_node_map[i].end_pfn -
 						early_node_map[i].start_pfn;
+		totalpages += pages;
+		if (pages)
+			node_set_state(early_node_map[i].nid,
+						N_HIGH_MEMORY);
+	}
 
 	return totalpages;
 }
@@ -3773,7 +3793,8 @@ void __init find_zone_movable_pfns_for_n
 	int i, nid;
 	unsigned long usable_startpfn;
 	unsigned long kernelcore_node, kernelcore_remaining;
-	int usable_nodes = num_online_nodes();
+	unsigned long totalpages = early_calculate_totalpages();
+	int usable_nodes = nodes_weight(node_states[N_HIGH_MEMORY]);
 
 	/*
 	 * If movablecore was specified, calculate what size of
@@ -3784,7 +3805,6 @@ void __init find_zone_movable_pfns_for_n
 	 * what movablecore would have allowed.
 	 */
 	if (required_movablecore) {
-		unsigned long totalpages = early_calculate_totalpages();
 		unsigned long corepages;
 
 		/*
@@ -3809,7 +3829,7 @@ void __init find_zone_movable_pfns_for_n
 restart:
 	/* Spread kernelcore memory as evenly as possible throughout nodes */
 	kernelcore_node = required_kernelcore / usable_nodes;
-	for_each_online_node(nid) {
+	for_each_node_state(nid, N_HIGH_MEMORY) {
 		/*
 		 * Recalculate kernelcore_node if the division per node
 		 * now exceeds what is necessary to satisfy the requested
Index: Linux/mm/mempolicy.c
===================================================================
--- Linux.orig/mm/mempolicy.c	2007-08-15 10:01:22.000000000 -0400
+++ Linux/mm/mempolicy.c	2007-08-16 09:45:47.000000000 -0400
@@ -130,7 +130,7 @@ static int mpol_check_policy(int mode, n
 			return -EINVAL;
 		break;
 	}
-	return nodes_subset(*nodes, node_online_map) ? 0 : -EINVAL;
+ 	return nodes_subset(*nodes, node_states[N_HIGH_MEMORY]) ? 0 : -EINVAL;
 }
 
 /* Generate a custom zonelist for the BIND policy. */


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH/RFC] memoryless nodes - fixup uses of node_online_map in generic code
  2007-08-16 14:17                         ` [PATCH/RFC] memoryless nodes - fixup uses of node_online_map in generic code Lee Schermerhorn
@ 2007-08-16 18:33                           ` Christoph Lameter
  2007-08-16 19:15                             ` Lee Schermerhorn
  0 siblings, 1 reply; 68+ messages in thread
From: Christoph Lameter @ 2007-08-16 18:33 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Andrew Morton, ak, linux-mm, Nishanth Aravamudan, pj, kxr,
	Mel Gorman, KAMEZAWA Hiroyuki, Eric Whitney

On Thu, 16 Aug 2007, Lee Schermerhorn wrote:

> Note questions about use of N_HIGH_MEMORY in find_next_best_node() and
> population of N_HIGH_MEMORY in early_calculate_totalpages().
> 
> Comments?

The changes in early_calculate_totalpages duplicate the setting of the bit 
in the N_HIGH_MEMORY map. But that could be removed with an additional 
patch if we are sure that early_calculate_totalpages is always called.

Otherwise it looks fine.

Acked-by: Christoph Lameter <clameter@sgi.com>

> mm/page_alloc.c:find_next_best_node()
> 
> 	skip nodes w/o memory.
> 	N_HIGH_MEMORY state mask may not be initialized at this time,
> 	unless we want to depend on early_calculate_totalpages() [see
> 	below].  Will ZONE_MOVABLE ever be configurable?

Hopefully it will be removed at some point.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH/RFC] memoryless nodes - fixup uses of node_online_map in generic code
  2007-08-16 18:33                           ` Christoph Lameter
@ 2007-08-16 19:15                             ` Lee Schermerhorn
  0 siblings, 0 replies; 68+ messages in thread
From: Lee Schermerhorn @ 2007-08-16 19:15 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, ak, linux-mm, Nishanth Aravamudan, pj, kxr,
	Mel Gorman, KAMEZAWA Hiroyuki, Eric Whitney

On Thu, 2007-08-16 at 11:33 -0700, Christoph Lameter wrote:
> On Thu, 16 Aug 2007, Lee Schermerhorn wrote:
> 
> > Note questions about use of N_HIGH_MEMORY in find_next_best_node() and
> > population of N_HIGH_MEMORY in early_calculate_totalpages().
> > 
> > Comments?
> 
> The changes in early_calculate_totalpages duplicate the setting of the bit 
> in the N_HIGH_MEMORY map. But that could be removed with an additional 
> patch if we are sure that early_calculate_totalpages is always called.
> 
> Otherwise it looks fine.
> 
> Acked-by: Christoph Lameter <clameter@sgi.com>
> 
> > mm/page_alloc.c:find_next_best_node()
> > 
> > 	skip nodes w/o memory.
> > 	N_HIGH_MEMORY state mask may not be initialized at this time,
> > 	unless we want to depend on early_calculate_totalpages() [see
> > 	below].  Will ZONE_MOVABLE ever be configurable?
> 
> Hopefully it will be removed at some point.

That was my concern.  I've heard that mentioned, so I didn't want to
depend on the early_calculate_totalpages().  It's only called from the
zone_movable setup, so I expect it will go away when zone movable goes.

Maybe we could move the populating of N_*_MEMORY to
free_area_init_nodes().  There's a loop over all on-line nodes there
that calls free_area_init_node() from whence calculate_node_totalpages()
is called.  On return from free_area_init_node(), the node's
node_present_pages has been set.   I'll work up and test an additional
patch.

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH/RFC] memoryless nodes - fixup uses of node_online_map in generic code
  2007-08-08 23:40                       ` Christoph Lameter
  2007-08-16 14:17                         ` [PATCH/RFC] memoryless nodes - fixup uses of node_online_map in generic code Lee Schermerhorn
@ 2007-08-16 21:10                         ` Lee Schermerhorn
  2007-08-16 21:13                           ` Christoph Lameter
  2007-08-24 16:09                         ` [PATCH] 2.6.23-rc3-mm1 - Move setup of N_CPU node state mask Lee Schermerhorn
  2 siblings, 1 reply; 68+ messages in thread
From: Lee Schermerhorn @ 2007-08-16 21:10 UTC (permalink / raw)
  To: Christoph Lameter, Andrew Morton
  Cc: ak, linux-mm, Nishanth Aravamudan, pj, kxr, Mel Gorman,
	KAMEZAWA Hiroyuki, Eric Whitney

A slightly reworked version.  See change log.  

Tested:  printed node masks after __build_all_zonelists.  They all look
OK, except that it appears process_zones() isn't getting called on my
platform, so N_CPU mask is not being populated.  Still investigating.

Lee
------------------------------
PATCH/RFC Fix generic usage of node_online_map - V2

Against 2.6.23-rc2-mm2

V1 -> V2:
+ moved population of N_HIGH_MEMORY node state mask to
  free_area_init_node(), as this is called before we
  build zonelists.  So, we can use this mask in 
  find_next_best_node.  Still need to keep the duplicate
  code in early_calculate_totalpages() for zone movable
  setup.

mm/shmem.c:shmem_parse_mpol()

	Ensure nodelist is subset of nodes with memory.
	Use node_states[N_HIGH_MEMORY] as default for missing
	nodelist for interleave policy.

mm/shmem.c:shmem_fill_super()

	initialize policy_nodes to node_states[N_HIGH_MEMORY]

mm/page-writeback.c:highmem_dirtyable_memory()

	sum over nodes with memory

mm/swap_prefetch.c:clear_last_prefetch_free()
                   clear_last_current_free()

	use nodes with memory for prefetch nodes.
	just in case ...

mm/page_alloc.c:zlc_setup()

	allowednodes - use nodes with memory.

mm/page_alloc.c:default_zonelist_order()

	average over nodes with memory.

mm/page_alloc.c:find_next_best_node()

	visit only nodes with memory [N_HIGH_MEMORY mask]
	looking for next best node for fallback zonelists.

mm/page_alloc.c:find_zone_movable_pfns_for_nodes()

	spread kernelcore over nodes with memory.

	This required calling early_calculate_totalpages()
	unconditionally, and populating N_HIGH_MEMORY node
	state therein from nodes in the early_node_map[].
	This duplicates the code in free_area_init_node(), but
	I don't want to depend on this copy if ZONE_MOVABLE 
	might go away, taking early_calculate_totalpages()
	with it.

mm/mempolicy.c:mpol_check_policy()

	Ensure nodes specified for policy are subset of
	nodes with memory.


Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

 mm/mempolicy.c      |    2 -
 mm/page-writeback.c |    2 -
 mm/page_alloc.c     |   67 ++++++++++++++++++++++++++++++----------------------
 mm/shmem.c          |   13 ++++++----
 mm/swap_prefetch.c  |    4 +--
 5 files changed, 51 insertions(+), 37 deletions(-)

Index: Linux/mm/shmem.c
===================================================================
--- Linux.orig/mm/shmem.c	2007-08-15 10:01:22.000000000 -0400
+++ Linux/mm/shmem.c	2007-08-16 15:03:15.000000000 -0400
@@ -971,7 +971,7 @@ static inline int shmem_parse_mpol(char 
 		*nodelist++ = '\0';
 		if (nodelist_parse(nodelist, *policy_nodes))
 			goto out;
-		if (!nodes_subset(*policy_nodes, node_online_map))
+		if (!nodes_subset(*policy_nodes, node_states[N_HIGH_MEMORY]))
 			goto out;
 	}
 	if (!strcmp(value, "default")) {
@@ -996,9 +996,11 @@ static inline int shmem_parse_mpol(char 
 			err = 0;
 	} else if (!strcmp(value, "interleave")) {
 		*policy = MPOL_INTERLEAVE;
-		/* Default to nodes online if no nodelist */
+		/*
+		 * Default to online nodes with memory if no nodelist
+		 */
 		if (!nodelist)
-			*policy_nodes = node_online_map;
+			*policy_nodes = node_states[N_HIGH_MEMORY];
 		err = 0;
 	}
 out:
@@ -1060,7 +1062,8 @@ shmem_alloc_page(gfp_t gfp, struct shmem
 	return page;
 }
 #else
-static inline int shmem_parse_mpol(char *value, int *policy, nodemask_t *policy_nodes)
+static inline int shmem_parse_mpol(char *value, int *policy,
+						nodemask_t *policy_nodes)
 {
 	return 1;
 }
@@ -2239,7 +2242,7 @@ static int shmem_fill_super(struct super
 	unsigned long blocks = 0;
 	unsigned long inodes = 0;
 	int policy = MPOL_DEFAULT;
-	nodemask_t policy_nodes = node_online_map;
+	nodemask_t policy_nodes = node_states[N_HIGH_MEMORY];
 
 #ifdef CONFIG_TMPFS
 	/*
Index: Linux/mm/page-writeback.c
===================================================================
--- Linux.orig/mm/page-writeback.c	2007-08-15 10:01:22.000000000 -0400
+++ Linux/mm/page-writeback.c	2007-08-15 10:13:49.000000000 -0400
@@ -126,7 +126,7 @@ static unsigned long highmem_dirtyable_m
 	int node;
 	unsigned long x = 0;
 
-	for_each_online_node(node) {
+	for_each_node_state(node, N_HIGH_MEMORY) {
 		struct zone *z =
 			&NODE_DATA(node)->node_zones[ZONE_HIGHMEM];
 
Index: Linux/mm/swap_prefetch.c
===================================================================
--- Linux.orig/mm/swap_prefetch.c	2007-08-15 10:01:22.000000000 -0400
+++ Linux/mm/swap_prefetch.c	2007-08-15 10:13:49.000000000 -0400
@@ -249,7 +249,7 @@ static void clear_last_prefetch_free(voi
 	 * Reset the nodes suitable for prefetching to all nodes. We could
 	 * update the data to take into account memory hotplug if desired..
 	 */
-	sp_stat.prefetch_nodes = node_online_map;
+	sp_stat.prefetch_nodes = node_states[N_HIGH_MEMORY];
 	for_each_node_mask(node, sp_stat.prefetch_nodes) {
 		struct node_stats *ns = &sp_stat.node[node];
 
@@ -261,7 +261,7 @@ static void clear_current_prefetch_free(
 {
 	int node;
 
-	sp_stat.prefetch_nodes = node_online_map;
+	sp_stat.prefetch_nodes = node_states[N_HIGH_MEMORY];
 	for_each_node_mask(node, sp_stat.prefetch_nodes) {
 		struct node_stats *ns = &sp_stat.node[node];
 
Index: Linux/mm/page_alloc.c
===================================================================
--- Linux.orig/mm/page_alloc.c	2007-08-15 10:05:41.000000000 -0400
+++ Linux/mm/page_alloc.c	2007-08-16 15:11:36.000000000 -0400
@@ -1302,7 +1302,7 @@ int zone_watermark_ok(struct zone *z, in
  *
  * If the zonelist cache is present in the passed in zonelist, then
  * returns a pointer to the allowed node mask (either the current
- * tasks mems_allowed, or node_online_map.)
+ * tasks mems_allowed, or node_states[N_HIGH_MEMORY].)
  *
  * If the zonelist cache is not available for this zonelist, does
  * nothing and returns NULL.
@@ -1331,7 +1331,7 @@ static nodemask_t *zlc_setup(struct zone
 
 	allowednodes = !in_interrupt() && (alloc_flags & ALLOC_CPUSET) ?
 					&cpuset_current_mems_allowed :
-					&node_online_map;
+					&node_states[N_HIGH_MEMORY];
 	return allowednodes;
 }
 
@@ -2127,7 +2127,7 @@ static int find_next_best_node(int node,
 		return node;
 	}
 
-	for_each_online_node(n) {
+	for_each_node_state(n, N_HIGH_MEMORY) {
 		cpumask_t tmp;
 
 		/* Don't want a node to appear more than once */
@@ -2264,7 +2264,8 @@ static int default_zonelist_order(void)
   	 * If there is a node whose DMA/DMA32 memory is very big area on
  	 * local memory, NODE_ORDER may be suitable.
          */
-	average_size = total_size / (num_online_nodes() + 1);
+	average_size = total_size /
+				(nodes_weight(node_states[N_HIGH_MEMORY]) + 1);
 	for_each_online_node(nid) {
 		low_kmem_size = 0;
 		total_size = 0;
@@ -2423,20 +2424,6 @@ static void build_zonelist_cache(pg_data
 
 #endif	/* CONFIG_NUMA */
 
-/* Any regular memory on that node ? */
-static void check_for_regular_memory(pg_data_t *pgdat)
-{
-#ifdef CONFIG_HIGHMEM
-	enum zone_type zone_type;
-
-	for (zone_type = 0; zone_type <= ZONE_NORMAL; zone_type++) {
-		struct zone *zone = &pgdat->node_zones[zone_type];
-		if (zone->present_pages)
-			node_set_state(zone_to_nid(zone), N_NORMAL_MEMORY);
-	}
-#endif
-}
-
 /* return values int ....just for stop_machine_run() */
 static int __build_all_zonelists(void *dummy)
 {
@@ -2447,11 +2434,6 @@ static int __build_all_zonelists(void *d
 
 		build_zonelists(pgdat);
 		build_zonelist_cache(pgdat);
-
-		/* Any memory on that node */
-		if (pgdat->node_present_pages)
-			node_set_state(nid, N_HIGH_MEMORY);
-		check_for_regular_memory(pgdat);
 	}
 	return 0;
 }
@@ -3750,14 +3732,24 @@ unsigned long __init find_max_pfn_with_a
 	return max_pfn;
 }
 
+/*
+ * early_calculate_totalpages()
+ * Sum pages in active regions for movable zone.
+ * Populate N_HIGH_MEMORY for calculating usable_nodes.
+ */
 static unsigned long __init early_calculate_totalpages(void)
 {
 	int i;
 	unsigned long totalpages = 0;
 
-	for (i = 0; i < nr_nodemap_entries; i++)
-		totalpages += early_node_map[i].end_pfn -
+	for (i = 0; i < nr_nodemap_entries; i++) {
+		unsigned long pages = early_node_map[i].end_pfn -
 						early_node_map[i].start_pfn;
+		totalpages += pages;
+		if (pages)
+			node_set_state(early_node_map[i].nid,
+						N_HIGH_MEMORY);
+	}
 
 	return totalpages;
 }
@@ -3773,7 +3765,8 @@ void __init find_zone_movable_pfns_for_n
 	int i, nid;
 	unsigned long usable_startpfn;
 	unsigned long kernelcore_node, kernelcore_remaining;
-	int usable_nodes = num_online_nodes();
+	unsigned long totalpages = early_calculate_totalpages();
+	int usable_nodes = nodes_weight(node_states[N_HIGH_MEMORY]);
 
 	/*
 	 * If movablecore was specified, calculate what size of
@@ -3784,7 +3777,6 @@ void __init find_zone_movable_pfns_for_n
 	 * what movablecore would have allowed.
 	 */
 	if (required_movablecore) {
-		unsigned long totalpages = early_calculate_totalpages();
 		unsigned long corepages;
 
 		/*
@@ -3809,7 +3801,7 @@ void __init find_zone_movable_pfns_for_n
 restart:
 	/* Spread kernelcore memory as evenly as possible throughout nodes */
 	kernelcore_node = required_kernelcore / usable_nodes;
-	for_each_online_node(nid) {
+	for_each_node_state(nid, N_HIGH_MEMORY) {
 		/*
 		 * Recalculate kernelcore_node if the division per node
 		 * now exceeds what is necessary to satisfy the requested
@@ -3901,6 +3893,20 @@ restart:
 			roundup(zone_movable_pfn[nid], MAX_ORDER_NR_PAGES);
 }
 
+/* Any regular memory on that node ? */
+static void check_for_regular_memory(pg_data_t *pgdat)
+{
+#ifdef CONFIG_HIGHMEM
+	enum zone_type zone_type;
+
+	for (zone_type = 0; zone_type <= ZONE_NORMAL; zone_type++) {
+		struct zone *zone = &pgdat->node_zones[zone_type];
+		if (zone->present_pages)
+			node_set_state(zone_to_nid(zone), N_NORMAL_MEMORY);
+	}
+#endif
+}
+
 /**
  * free_area_init_nodes - Initialise all pg_data_t and zone data
  * @max_zone_pfn: an array of max PFNs for each zone
@@ -3978,6 +3984,11 @@ void __init free_area_init_nodes(unsigne
 		pg_data_t *pgdat = NODE_DATA(nid);
 		free_area_init_node(nid, pgdat, NULL,
 				find_min_pfn_for_node(nid), NULL);
+
+		/* Any memory on that node */
+		if (pgdat->node_present_pages)
+			node_set_state(nid, N_HIGH_MEMORY);
+		check_for_regular_memory(pgdat);
 	}
 }
 
Index: Linux/mm/mempolicy.c
===================================================================
--- Linux.orig/mm/mempolicy.c	2007-08-15 10:01:22.000000000 -0400
+++ Linux/mm/mempolicy.c	2007-08-16 15:03:15.000000000 -0400
@@ -130,7 +130,7 @@ static int mpol_check_policy(int mode, n
 			return -EINVAL;
 		break;
 	}
-	return nodes_subset(*nodes, node_online_map) ? 0 : -EINVAL;
+ 	return nodes_subset(*nodes, node_states[N_HIGH_MEMORY]) ? 0 : -EINVAL;
 }
 
 /* Generate a custom zonelist for the BIND policy. */




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH/RFC] memoryless nodes - fixup uses of node_online_map in generic code
  2007-08-16 21:10                         ` Lee Schermerhorn
@ 2007-08-16 21:13                           ` Christoph Lameter
  0 siblings, 0 replies; 68+ messages in thread
From: Christoph Lameter @ 2007-08-16 21:13 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Andrew Morton, ak, linux-mm, Nishanth Aravamudan, pj, kxr,
	Mel Gorman, KAMEZAWA Hiroyuki, Eric Whitney

I wonder if we could also add some /proc field to display the mask?

Something like

/proc/numainfo

Which contains

Online:	<nodelist>
Possible: <nodelist>
Regular memory: <nodelist>
High memory: <nodelist>

?

That way user space can figure out what is possible on each node.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH] 2.6.23-rc3-mm1 - Move setup of N_CPU node state mask
  2007-08-08 23:40                       ` Christoph Lameter
  2007-08-16 14:17                         ` [PATCH/RFC] memoryless nodes - fixup uses of node_online_map in generic code Lee Schermerhorn
  2007-08-16 21:10                         ` Lee Schermerhorn
@ 2007-08-24 16:09                         ` Lee Schermerhorn
  2007-09-06 13:56                           ` Mel Gorman
  2 siblings, 1 reply; 68+ messages in thread
From: Lee Schermerhorn @ 2007-08-24 16:09 UTC (permalink / raw)
  To: linux-mm
  Cc: Christoph Lameter, Andrew Morton, Nishanth Aravamudan,
	Mel Gorman, KAMEZAWA Hiroyuki, Eric Whitney

Saw this while looking at "[BUG] 2.6.23-rc3-mm1 kernel BUG at
mm/page_alloc.c:2876!".  Not sure it matters, as apparently, failure to
kmalloc() the zone pcp will bug out later anyway.

Lee
--------------------------

[PATCH] Move setup of N_CPU node state mask

Against:  2.6.23-rc3-mm1

Move recording of nodes w/ cpus to before zone loop.
Otherwise, error exit could skip setup of N_CPU mask.  

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

 mm/page_alloc.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Index: linux-2.6.23-rc3-mm1/mm/page_alloc.c
===================================================================
--- linux-2.6.23-rc3-mm1.orig/mm/page_alloc.c	2007-08-22 10:08:00.000000000 -0400
+++ linux-2.6.23-rc3-mm1/mm/page_alloc.c	2007-08-22 10:08:44.000000000 -0400
@@ -2793,6 +2793,8 @@ static int __cpuinit process_zones(int c
 	struct zone *zone, *dzone;
 	int node = cpu_to_node(cpu);
 
+	node_set_state(node, N_CPU);	/* this node has a cpu */
+
 	for_each_zone(zone) {
 
 		if (!populated_zone(zone))
@@ -2810,7 +2812,6 @@ static int __cpuinit process_zones(int c
 			 	(zone->present_pages / percpu_pagelist_fraction));
 	}
 
-	node_set_state(node, N_CPU);
 	return 0;
 bad:
 	for_each_zone(dzone) {


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH] 2.6.23-rc3-mm1 - Move setup of N_CPU node state mask
  2007-08-24 16:09                         ` [PATCH] 2.6.23-rc3-mm1 - Move setup of N_CPU node state mask Lee Schermerhorn
@ 2007-09-06 13:56                           ` Mel Gorman
  0 siblings, 0 replies; 68+ messages in thread
From: Mel Gorman @ 2007-09-06 13:56 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, Christoph Lameter, Andrew Morton, Nishanth Aravamudan,
	Mel Gorman, KAMEZAWA Hiroyuki, Eric Whitney

On Fri, 2007-08-24 at 12:09 -0400, Lee Schermerhorn wrote:
> Saw this while looking at "[BUG] 2.6.23-rc3-mm1 kernel BUG at
> mm/page_alloc.c:2876!".  Not sure it matters, as apparently, failure to
> kmalloc() the zone pcp will bug out later anyway.
> 

If the failure path is entered, my expectation is that the CPU would not
appear otherwise active. I'm not convinced the old code is wrong.

> Lee
> --------------------------
> 
> [PATCH] Move setup of N_CPU node state mask
> 
> Against:  2.6.23-rc3-mm1
> 
> Move recording of nodes w/ cpus to before zone loop.
> Otherwise, error exit could skip setup of N_CPU mask.  
> 
> Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>
> 
>  mm/page_alloc.c |    3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> Index: linux-2.6.23-rc3-mm1/mm/page_alloc.c
> ===================================================================
> --- linux-2.6.23-rc3-mm1.orig/mm/page_alloc.c	2007-08-22 10:08:00.000000000 -0400
> +++ linux-2.6.23-rc3-mm1/mm/page_alloc.c	2007-08-22 10:08:44.000000000 -0400
> @@ -2793,6 +2793,8 @@ static int __cpuinit process_zones(int c
>  	struct zone *zone, *dzone;
>  	int node = cpu_to_node(cpu);
>  
> +	node_set_state(node, N_CPU);	/* this node has a cpu */
> +
>  	for_each_zone(zone) {
>  
>  		if (!populated_zone(zone))
> @@ -2810,7 +2812,6 @@ static int __cpuinit process_zones(int c
>  			 	(zone->present_pages / percpu_pagelist_fraction));
>  	}
>  
> -	node_set_state(node, N_CPU);
>  	return 0;
>  bad:
>  	for_each_zone(dzone) {
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

end of thread, other threads:[~2007-09-06 13:56 UTC | newest]

Thread overview: 68+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-07-27 19:43 [PATCH 00/14] NUMA: Memoryless node support V4 Lee Schermerhorn
2007-07-27 19:43 ` [PATCH 01/14] NUMA: Generic management of nodemasks for various purposes Lee Schermerhorn
2007-07-30 21:38   ` [PATCH/RFC] 2.6.23-rc1-mm1: MPOL_PREFERRED fixups for preferred_node < 0 Lee Schermerhorn
2007-07-30 22:00     ` Lee Schermerhorn
2007-07-31 15:32       ` Mel Gorman
2007-07-31 15:58         ` Lee Schermerhorn
2007-07-31 21:05     ` [PATCH/RFC] 2.6.23-rc1-mm1: MPOL_PREFERRED fixups for preferred_node < 0 - v2 Lee Schermerhorn
2007-08-01  2:22   ` [PATCH 01/14] NUMA: Generic management of nodemasks for various purposes Andrew Morton
2007-08-01  2:52     ` Christoph Lameter
2007-08-01  3:05       ` Andrew Morton
2007-08-01  3:14         ` Christoph Lameter
2007-08-01  3:32           ` Andrew Morton
2007-08-01  3:37             ` Christoph Lameter
     [not found]             ` <Pine.LNX.4.64.0707312151400.2894@schroedinger.engr.sgi.com>
2007-08-01  5:07               ` Andrew Morton
2007-08-01  5:11                 ` Andrew Morton
2007-08-01  5:22                 ` Christoph Lameter
2007-08-01 10:24                   ` Mel Gorman
2007-08-02 16:23                   ` Mel Gorman
2007-08-02 20:00                     ` Christoph Lameter
2007-08-01  5:36             ` Paul Mundt
2007-08-01  9:19             ` Andi Kleen
2007-08-01 14:03             ` Lee Schermerhorn
2007-08-01 17:41               ` Christoph Lameter
2007-08-01 17:54                 ` Lee Schermerhorn
2007-08-02 20:05                 ` [PATCH/RFC/WIP] cpuset-independent interleave policy Lee Schermerhorn
2007-08-02 20:34                   ` Christoph Lameter
2007-08-02 21:04                     ` Lee Schermerhorn
2007-08-03  0:31                       ` Christoph Lameter
2007-08-02 20:19                 ` Audit of "all uses of node_online()" Lee Schermerhorn
2007-08-02 20:26                   ` Christoph Lameter
2007-08-08 22:19                     ` Lee Schermerhorn
2007-08-08 23:40                       ` Christoph Lameter
2007-08-16 14:17                         ` [PATCH/RFC] memoryless nodes - fixup uses of node_online_map in generic code Lee Schermerhorn
2007-08-16 18:33                           ` Christoph Lameter
2007-08-16 19:15                             ` Lee Schermerhorn
2007-08-16 21:10                         ` Lee Schermerhorn
2007-08-16 21:13                           ` Christoph Lameter
2007-08-24 16:09                         ` [PATCH] 2.6.23-rc3-mm1 - Move setup of N_CPU node state mask Lee Schermerhorn
2007-09-06 13:56                           ` Mel Gorman
2007-08-02 20:33                   ` Audit of "all uses of node_online()" Andrew Morton
2007-08-02 20:45                     ` Lee Schermerhorn
2007-08-01 15:58           ` [PATCH 01/14] NUMA: Generic management of nodemasks for various purposes Nishanth Aravamudan
2007-08-01 16:09             ` Nishanth Aravamudan
2007-08-01 17:47             ` Christoph Lameter
2007-08-01 15:25         ` Nishanth Aravamudan
2007-07-27 19:43 ` [PATCH 02/14] Memoryless nodes: introduce mask of nodes with memory Lee Schermerhorn
2007-07-27 19:43 ` [PATCH 03/14] Memoryless Nodes: Fix interleave behavior Lee Schermerhorn
2007-07-27 19:43 ` [PATCH 04/14] OOM: use the N_MEMORY map instead of constructing one on the fly Lee Schermerhorn
2007-07-27 19:43 ` [PATCH 05/14] Memoryless Nodes: No need for kswapd Lee Schermerhorn
2007-07-27 19:43 ` [PATCH 06/14] Memoryless Node: Slab support Lee Schermerhorn
2007-07-27 19:44 ` [PATCH 07/14] Memoryless nodes: SLUB support Lee Schermerhorn
2007-07-27 19:44 ` [PATCH 08/14] Uncached allocator: Handle memoryless nodes Lee Schermerhorn
2007-07-27 19:44 ` [PATCH 09/14] Memoryless node: Allow profiling data to fall back to other nodes Lee Schermerhorn
2007-07-27 19:44 ` [PATCH 10/14] Memoryless nodes: Update memory policy and page migration Lee Schermerhorn
2007-07-27 19:44 ` [PATCH 11/14] Add N_CPU node state Lee Schermerhorn
2007-07-27 19:44 ` [PATCH 12/14] Memoryless nodes: Fix GFP_THISNODE behavior Lee Schermerhorn
2007-07-27 19:44 ` [PATCH 13/14] Memoryless Nodes: use "node_memory_map" for cpusets Lee Schermerhorn
2007-07-27 19:44 ` [PATCH 14/14] Memoryless nodes: drop one memoryless node boot warning Lee Schermerhorn
2007-07-27 20:59 ` [PATCH 00/14] NUMA: Memoryless node support V4 Nishanth Aravamudan
2007-07-30 13:48   ` Lee Schermerhorn
2007-07-29 12:35 ` Paul Jackson
2007-07-30 16:07   ` Lee Schermerhorn
2007-07-30 18:56     ` Paul Jackson
2007-07-30 21:19 ` Nishanth Aravamudan
2007-07-30 22:06   ` Christoph Lameter
2007-07-30 22:35     ` Andi Kleen
2007-07-30 22:36       ` Christoph Lameter
2007-07-31 23:18         ` Nishanth Aravamudan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox